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From the 
Editortn-Chief 




Bioengineering applications 


Before we begin to dis- 

cuss the articles in this is¬ 
sue, I’d like to bring to your 
attention two items of news 
that will positively influence 
the coming issues of IEEE 
Micro. Priscilla Lu, from 
AT&T, starts her appoint¬ 
ment with our editorial 
board this month; welcome! 
Priscilla’s biography and 
photograph appear in Mi¬ 
cro News this issue. In ad¬ 
dition, we can now provide 
a more-detailed editorial 
plan for 1992. Please turn 
to our back cover for this information. 

Now let’s turn to this issue, which was ini¬ 
tially planned to be devoted to computer appli¬ 
cations in biology and medicine and to 
bioengineering applications. We can offer a 
couple of articles related to this theme; they may 
open up new ideas for engineers and research¬ 
ers, out of the classic scenery for microelectronics. 

The first article, “Simulating a Function of Vi¬ 
sual Peripheral Processes with an Analog VLSI 
Network” by H. Li and C. Chen, describes how 
one can design an integrated motion detector in 
a single piece of silicon. Motion detection is cur¬ 
rently on the list of “hot” application problems. 
In this case note the design method, which re¬ 
lies on detailed mathematical formulation of the 
problem. The subject of this article inspired our 
front cover. 

Then we go directly to the core of the issue 
theme. To pick up useful electrical signals from 
muscles, researchers have developed ad-hoc 
techniques that aim both at noise reduction and 
at limiting the discomfort for the sample under 


measure (which is usually a person). These tech¬ 
niques, combined with technological develop¬ 
ments, permitted electromyography techniques 
to move from research labs to current clinical 
practice. The second article, “Computer Analy¬ 
sis of the Myoelectric Signal” by M. Knaflitz and 
G. Balestra, is a structured overview of the prob¬ 
lems and solutions that make up the area of work 
for biomedical engineers. A sequence of infor¬ 
mational boxes builds a detailed tutorial intro¬ 
duction for those not familiar with handling 
biological signals. 

After these tours near the boundary, we move 
to the remaining three articles, all of which are 
well inside the core of Micro's area of interest. 
These three articles ideally continue the flow of 
advanced processing architectures presented in 
the “Hot Chips” and “Far East” issues. “Alphom, 
a Remote Procedure Call Environment for Fault- 
Tolerant, Heterogeneous, Distributed Systems” 
by H.R. Aschmann and others, presents a com¬ 
plete environment for programming, configura¬ 
tion, and communication in real-time distributed 
processing systems. 

Next we dive deeply into hardware. As 2- 
million transistor chips become feasible, relatively 
small processors (which just a few years before 
could hardly be squeezed onto one chip) be¬ 
come library parts for the design of complex 
silicon systems. “A RISC Processor for Embed¬ 
ded Applications Within an ASIC” by C.E. Rob¬ 
erts is one further step in this direction. I would 
warn all readers: Watch Micro's pages carefully, 
and you will see an article in the near future 
something like “A Superscalar Processor as a 
Basic Cell for Super ASICs...!” 

The articles in this issue end with “Performance 
and the i860 Microprocessor” by M. Atkins. This 
article describes in detail the design solutions 
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that allow this processor to achieve 
such impressive performance an¬ 
nounces the next processor of the fam¬ 
ily (one with 2.5M transistors). For an 
introductory tour of the i860 architec¬ 
ture and capabilities, you may want to 


start with “Introducing the Intel i860” 
in the August 1989 issue of Micro. 








In the mailbag 


(LK: liked, DLK: disliked, ITS: like 
to see) 

October 1989 

LK: Micro World, Micro Law; 
please keep them going; LTS: More 
on Europe, like Micro World.—N.B. 
(We will be able to keep these col¬ 
umns in our magazine, thanks to the 
good work of columnists Hubert 
Kirrmann and Richard Stern.— 
D.D.C.) 

February 1991 

DLK: The total confusion over 
what article went where!!! Let’s go 
back to a simple straightforward lay¬ 
out—each article in one place! I just 
got too lost.—R.N.C., Ryde, Austra¬ 
lia (This problem still seems to be 
hot. Please continue sending your 
comments; anybody in favor of our 
current arrangement?—D.D.C.) 

April 1991 

LK: This is my first issue.... I liked 
it veiy much.—C.H., Helsinor, Den¬ 
mark (And I am sure you will ap¬ 
preciate the next ones too!—D.D.C.) 

DLK: Splitting articles—G.G., 
Fairfield, VA (See above.—D.D.C.) 

LK: Micro Law, Micro View, New 
Products; DLK: Most articles are too 


technical (mathematical). They be¬ 
long in transactions! LTS: Any hope 
of a special issue on work in the 
developing countries?—M.T., Ikeja, 
Nigeria (Your first point is again 
credit for the respective columnists, 
Stem, Slater, and Hootman. Unfor¬ 
tunately, we had to discontinue Mi¬ 
cro View for a while; but you will 
see it again in another form. Let me 
disagree with your second point: Our 
work demands something more than 
just a few mathematics, and the ar¬ 
ticles published in Micro substantially 
differ from the ones appearing in the 
various transactions. On your final 
point: Why not? All we need is a pro¬ 
posal and a guest editor. If any read¬ 
ers are interested, or know of 
someone who is, please contact ei¬ 
ther me or Ashis Kahn at the ad¬ 
dresses listed on page 1 in this 
issue.—D.D.C.) 

LK: Optical computations—A.N., 
Teheran, Iran 


June 1991 

DLK: Splitting articles between 
front and back of the magazine— 
S.N.K., Laurel, MO 


Coming in 
December 

Look for articles on 
database machines 
in the next issue of 
IEEE Micro 

• VLSI accelerators for large data¬ 
base systems— from Bellcore 

• An associative accelerator for large 
databases— from Liboratoire MASI in 
France 

• RINDA: A relational database pro¬ 
cessor with hardware specialized 
for searching and sorting— from 
NTT in Japan 

• A fine-grain architecture for rela¬ 
tional database aggregation op¬ 
erations— from George Mason 
University 

• A parallel, scalable, and micropro¬ 
cessor-based database computer 
for performance gains and capac¬ 
ity growth— from Naval Postgradu¬ 
ate School 


Expedited delivery 

is available to all members residing 
outside the LISA, Canada, and Mexico. 
We invite you to take advantage of 
this service providing delivery of your 
magazine weeks earlier. 

♦ 

For information on this service and 
its cost, contact: 

Expedited Delivery 
IEEE Computer Society 
10662 Los Vaqueros Circle 
PO Box 3014 

Los Alamitos, CA 90720-1264 
USA 

Phone: (714) 821-8380 
Fax: (714) 821-4010 
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Review 



Richard Mateosian 

2919 Forest Avenue 
Berkeley, CA 
94705-1310 
(510) 540-7745 


Computers and creativity 


C ast issue I reviewed The Emperor’s New 
Mind by Roger Penrose, a book that is 
openly hostile to the view that comput¬ 
ers can emulate people. The book that I’ve re¬ 
viewed this time takes a contrasting position. 

The Creative Mind—Myths and Mechanisms, 

Margaret A. Boden (Basic Books, a division of 
Harper Collins, New York, 1990, 316 pp., $24.93) 
Margaret Boden explores creativity from a sci¬ 
entific viewpoint. She looks at some impressive 
computer programs to see what insights they 
give us into the mechanisms of creativity. She 
also looks at myths and prejudices about using 
computers to understand or emulate human 
creativity. 

Boden approaches creativity from the view¬ 
point of the computational psychologist. She is 
interested in the thought processes and mental 
structures that support creative behavior. Before 
she can begin, however, she must deal with two 
popularly held views that are antagonistic to any 
scientific approach to creativity. The inspirational 
view is like the picture of Mozart in the movie 
Amadeus , namely that creativity arises from a 
mysterious or divine spark that no amount of 
hard work or virtue can emulate. Thus the child¬ 
ish, boorish Mozart makes heavenly music while 
the hard-working Salieri’s compositions never rise 
above the level of charming mediocrity. The ro¬ 
mantic view is similar but less extreme. It holds 
that creative people have an innate talent—in¬ 
sight or intuition—that others lack. Some people 
are born creative, and the rest can only watch 
and admire them. 

To Boden, inspiration and insight or intuition 
are not answers but merely ill-defined questions. 
She wants to know how people go about think¬ 


ing new thoughts, and she looks to the field of 
artificial intelligence for ideas that will help us to 
understand this. 

She poses and attempts to answer four 
questions: 

• Can computational ideas help us to under¬ 
stand how human creativity is possible? 

• Can computers ever do anything that ap¬ 
pears to be creative? 

• Can computers ever appear to recognize 
creativity, for example, in a poem written 
by a human? 

• Can computers ever really be creative? 

She calls these her four Lovelace questions, 
after Ada, Lady Lovelace, who first propounded 
the argument that many people would use in 
answering no to all of them: Computers cannot 
create, because they can only do what they are 
programmed to do. Boden answers yes to the 
first three Lovelace questions and probably not 
to the fourth, which she considers to be the least 
interesting for her purposes. 

To proceed, Boden must first define creativ¬ 
ity. She first reviews what creative people like 
the chemist Kekule and the poet Coleridge have 
said about the sources of their ideas. Kekule con¬ 
ceived the idea of the benzene ring while doz¬ 
ing in front of the fire and watching imaginary 
molecular strings and snakes gamboling in the 
flames. Coleridge devised Kubla Khan in a simi¬ 
lar reverie. In the course of the book Boden 
returns to these examples and shows how mecha¬ 
nisms that Kekule and Coleridge did not report 
were actually at work in the forming of their 
creations. 

In defining creativity Boden distinguishes psy- 
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chological creativity (P-creativity) from 
historical creativity (H-creativity). If 
Mary Smith has an idea that she could 
not have had before, and she recog¬ 
nizes its significance, the idea is P-cre- 
ative. If no one has ever had that idea 
before, it is also H-creative. For the pur¬ 
pose of scientific analysis, Boden con¬ 
siders only P-creativity. 

Having defined creativity, Boden 
begins to look at its mechanisms. She 
postulates that people build mental 
maps of conceptual spaces (for ex¬ 
ample, quantum theory in physics or 
impressionism in art). One mechanism 
of creativity is to apply some heuristic 
(for example, consider the negative) 
to change or extend the map. Thus 
Kekule’s introduction of ring structure 
extended the conceptual space, which 
up to Kekule’s time only included lin¬ 
ear strings of carbon atoms. Boden sees 
AI as an ideal discipline for represent¬ 
ing and exploring conceptual spaces. 

Some of the techniques of AI are 
generative systems and heuristics. Gen¬ 
erative systems allow systematic enu¬ 
meration of points of the conceptual 
space. Heuristics are a priori consider¬ 
ations that allow searches of the con¬ 
ceptual space to be restricted to 
promising candidates. For example, a 
program trying to determine the home 
key of a fugue can use the very first 
note to suggest the four most likely 
candidates; one of these four will be 
the home key in almost all cases. 

Representation is important in hu¬ 
man thinking and in computer pro¬ 
grams. The change from the roman to 
the arabic numbering system made 
possible huge advances in computa¬ 
tion and ultimately in mathematics. 
Scripts, frames, and semantic nets are 
representations of knowledge com¬ 
monly used in AI programs. Represen¬ 
tations can assist creativity, but they can 
also inhibit it. For example, drawing a 
diagram can make it difficult to solve 
the following easy problem: 

There are two houses x feet 

apart. A 20-foot string is sus¬ 


pended between them. The 
points, A and B, at which the 
string is attached are at an 
equal distance above the 
ground. This distance is high 
enough to allow the string to 
hang freely without touching 
the ground. The vertical sag in 
the string is 10 feet. What is 
the value of x? 

No book dealing with current AI 
techniques can ignore neural nets. 
Boden shows how parallel distributed 
processing can be used to construct 
self-organizing systems that learn to 
recognize patterns and associate ideas 
in ways that are externally similar to 
what humans do. For example, she 
describes a system that learns the past 
tenses of English verbs, making the 
same pattern of errors that children 
make in the course of learning. 

Boden considers some successful AI 
programs, because she believes that 
they shed light on the way human cre¬ 
ativity works. For example, Harold 
Cohen, an established artist, has pro¬ 
duced a program called Aaron, which 
produces drawings that many human 
artists would be proud of. Aaron con¬ 
tains general knowledge about com¬ 
position and drawing techniques and 
specific knowledge about the kinds of 
subjects it draws. Its internal represen¬ 
tations allow it to vary its drawings in¬ 
definitely within a set of well-defined 
constraints that define its “style.” 

Boden also considers a jazz impro¬ 
visation program that models the styles 
of Charlie Parker and Dizzy Gillespie. 
Noting that jazz pianists make up these 
improvisations as they go along, the 
programmer has found algorithms that 
require little computing or memory to 
execute. The programmer claims that 
the program performs as well as a mod¬ 
erately competent beginner. 

Even more interesting to me is a pro¬ 
gram called Acme, which recognizes 
analogies. Acme can understand and 
explain Socrates’ characterization of 
himself as a midwife of ideas. Acme 


uses a connectionist system called 
Word Net as its dictionary. The entries 
in Word Net are concepts, and the links 
between them are relationships like 
synonym, antonym, subordinate, and 
part. 

In addition to the above examples 
of artificial intelligence programs emu¬ 
lating artistic creativity, Boden also 
gives numerous examples of programs 
in the scientific field. She describes 
expert systems like Dendral, the well- 
known molecular analysis program, 
and a program that successfully diag¬ 
noses soybean diseases. These pro¬ 
grams use induction to revise and 
improve their decision making. Thus, 
the soybean program diagnosed cor¬ 
rectly only 83 percent of cases using 
its initial set of rules, but after learning 
from a large set of examples it im¬ 
proved its percentage to nearly 100 
percent. 

Abstract Mathematician is a program 
that starts with 100 primitive mathemati¬ 
cal concepts and about 300 heuristics 
and looks for interesting patterns. This 
program has discovered several new 
theorems and a variety of well-known 
ideas like the famous Goldbach con¬ 
jecture (every even number greater 
than 2 is the sum of two different 
primes). 

Having led the reader through a 
number of quite impressive examples, 
Boden returns to the Lovelace ques¬ 
tions and explains her answers to them. 
To do this, she must first tear down 
another straw man. To counter the ar¬ 
gument that unpredictability is the es¬ 
sence of creativity, she takes apart and 
analyzes the concepts of chance, chaos, 
and randomness. Then she points out 
that prediction is not what science is 
all about, but only a side effect of its 
real goal: To explain how it is possible 
for things to happen as they do. 

One of the most important points 
that Boden makes in her demystifica¬ 
tion of creativity is that its mechanisms 
are available to everyone. She first asks 
the reader to recount all the thoughts 
that occur during an attempt to com- 
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plete a limerick with a given first line. 
Then she says to try the exercise again, 
thinking out loud. In the second case, 
the mental processes are much more 
apparent. The point is that we don’t 
normally remember all of the things 
that go on inside our heads, and nei¬ 
ther do people like Coleridge or Kekule 
who report on the mystical ways that 
ideas come to them. In Boden’s view 
creativity entails abilities that we all 
have: noticing, remembering, seeing, 
speaking, hearing, understanding lan¬ 
guage, understanding analogies. Mozart 
may have been better at some of these 
things than the rest of us, but he be¬ 
longed to the same genus and species. 

The contention that creativity pro¬ 
ceeds by mechanisms available to all 
of us is essential to Boden’s belief in 
her positive answers to the Lovelace 
questions. To her, the rejection of ma¬ 
chines as potential advisers, compan¬ 
ions, entertainers, and so on smacks 
of human chauvinism rather than a 
provable difference between what we 
do and what they might someday be 
capable of doing. 

I picked up this book, and I read it 
every available moment until I finished 
it. Margaret Boden writes creatively 
about creativity. She writes about men¬ 
tal associations and allusions in a style 
filled with associations and allusions. 
Time and again she reintroduces old 
examples in surprising new contexts 
or simply turns a very clever phrase. 

Boden takes examples from many 
fields of art and science, and she seems 
at home in all of them. In one case, 
for example, having referred much 
earlier to Shakespearean sonnets, she 
is talking about programs that can find 
rhymes. She mentions that the program 
might even be able to see love/prove 
as a rhyme. This statement stands by 
itself, but some readers will realize that 
it is an allusion to the fact that no less 
than four of Shakespeare’s sonnets use 
that rhyme in their closing couplets. 

I strongly recommend this book to 
anyone interested in creativity and the 
field of artificial intelligence research. 


^ Micro 
News 


Organic compounds: Future 
electrical materials? 

New, lower priced batteries, semi¬ 
conductors, and other electrical devices 
capable of being produced at much 
lower temperatures may one day prove 
possible if current materials research 
succeeds. 

Chemists at last February’s meeting 
of the American Association for the 
Advancement of Science reported find¬ 
ings of interesting electrical, optical, or 
magnetic properties in a series of syn¬ 
thesized compounds. These one¬ 
dimensional materials also differ from 
metals in resistance and properties. One 
salt known as TTeF-TcNQ uses tellu¬ 
rium to improve its electrical proper¬ 
ties. This material remains metallic to 
temperatures as low as 1.5 Kelvin. 

Dwaine O. Cowan, Theodore 
Poehler, and Thomas Kistenmacher of 
Johns Hopkins University have dem¬ 
onstrated the possibility of construct¬ 
ing organic solids that exhibit either 
metallic or superconducting properties 
under the proper conditions. Though 
the compounds are difficult to make, 
several groups in the US, France, Ja¬ 
pan, and Denmark report efforts de¬ 
signed to exploit the potential benefits 
of the compounds. 

Cowan’s group created the first or¬ 
ganic metal in 1972, and former 
Hopkins postdoctoral fellow Klaus 
Bechgaard created the first organic su¬ 
perconductor 10 years ago. 


6 IEEE Micro 


\ 

Send information for inclusion in Micro News 
one month before cover date to Managing 
Editor, IEEE Micro, PO Box 3014, Los Alamitos, 
CA 90720-1264. 


Beyond the stethoscope 

“Sounds made by normal and de¬ 
fective hearts frequently differ so sub¬ 
tly that only a highly trained heart 
specialist with years of experience can 
tell the difference.” Though this state¬ 
ment from Yale University physicist 
William R. Bennett, Jr., may sound 
obvious, it pinpoints the reasons 
Bennett’s diagnostic invention has 
proved so helpful. 

Though heart defects produce dis¬ 
tinctive sounds, most physicians find 
them hard to recognize because they 
may see only a few patients in a life¬ 
time with certain heart defects. 
Bennett’s dynamic spectral phono- 
cardiograph helps physicians by greatly 
amplifying sound differences; it is much 
more sensitive than a stethoscope and 
works successfully in noisy hospital 
environments. 

Using a Macintosh Ilci, Bennett cre¬ 
ated a software program that first con¬ 
verts heart sounds into numbers and 
then into images via a fast Fourier trans¬ 
form. As shown in the figure, a print¬ 
out of the sounds displays loud beats 
as jagged peaks and softer murmurs as 
shallow squiggles. A color monitor lets 
users distinguish red normal heart 
sounds from yellow abnormal sounds. 
“The phonocardiograph will enable 
physicians to watch the computer im¬ 
age of the heartbeat at the same time 
they are listening to the sound,” says 
Bennett, who is the C. Baldwin Saw- 








yer Professor of Engineering and Ap¬ 
plied Science at Yale. 

Bennett hopes his invention will help 
physicians avoid unnecessary surgery 
or pick the best time for surgery. Reeves 
Scientific Inc. of New Haven builds 
prototypes of the phonocardiograph for 
testing in 52 US teaching hospitals. 

In addition to the heart monitor, 
Bennett’s interest in sound waves has 
also spawned a design for a hearing 
aid that uses a chip to provide more 
accurate sound and less background 
noise than existing hearing aids. 


Though still under development, his 
hearing aid was patented last year. 

In I960 while at Bell Laboratories, 
Bennett coinvented the first gas laser, 
and in 1964 as a Yale faculty member, 
he invented the argon ion laser. 

The challenge of open buses 

Why do we need open buses? At¬ 
tendees at the Open Bus Systems 91 
conference in Paris this November 26- 
27th will hear Paul Borrill answer that 
very question. Other talks include 
David Wilner discussing new tech¬ 


niques for improving reliability of com¬ 
plex life-critical computer systems, and 
Cristopher Eck reporting on the 
VMEbus in the 90s. David Gustavson’s 
invited talk will focus on the Scalable 
Coherent Interface and new, related 
standards projects. Session topics in¬ 
clude industrial applications, multipro¬ 
cessor architectures, Futurebus+, 
real-time software, conformance test¬ 
ing, image processing, SCI, and board 
design. 

For information, write to Open Bus 
Systems 91, VITA Europe, PO Box 192, 
NL-5300 AD Zaltbommel, The Nether¬ 
lands; telephone at +31 4180 14661 or 
fax to +31 4180 15115. 

Editorial board expands 

Editor-in-Chief Dante Del Corso has 
appointed Priscilla 
Lu, director of im¬ 
aging systems at 
AT&T Computer 
Systems, to the edi¬ 
torial board of IEEE 
Micro. Lu will re¬ 
view manuscripts 

for the magazine. 

Lu joined AT&T as first supervisor 
of the microprocessor architecture 
group, became department head of the 
Microsystems Engineering Department, 
and later moved to the position of de¬ 
partment head of the Network and 
Systems Management organization for 
the Star Group product line. 

Lu holds a PhD in computer science 
and electrical engineering from North¬ 
western University and an MS and BS 
in computer science and mathematics 
from the University of Wisconsin, 
Madison. 
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Phonocardiogram printout of a normal heartbeat producing two sounds, SI 
and S2, followed by a silence represented as straight lines (top); and an abnor¬ 
mal heartbeat producing a murmur displayed as wavy lines (bottom). 































Simulating a Function of Visual 
Peripheral Processes with an 
Analog VLSI Network 


Using analog VLSI technology to implement algorithms that mimic human visual perception 
opens up a new avenue to approach machine intelligence. Our 2D network simulates a func¬ 
tion of motion perception in the visual peripheral process, carrying out computations with 
analog voltage and current. SPICE simulations and experiments using real images matched 
the results of the theoretical analysis. 


Hua Li 

Ching-Ho Chen 

Texas Tech University 




eneral experience leads us to believe 
that the human visual perception of 
motion consists of two processes: pe- 
I ripheral, which detects motion at a low 
level, and attentive, which interprets the motion. 12 
The peripheral process primarily directs the focus 
of attention. It is important in guiding one’s reac¬ 
tion, orienting responses and locomotion, and 
usually serves as the first stage for cognition. For 
example, we first detect moving objects and 
record their rough locations and sizes. Then, we 
decide whether further high-level processing is 
necessary. 3 The peripheral process is presumably 
spatially parallel so that the entire visual field is 
processed quickly. 

Computationally, we can model the peripheral 
process by noting the differences between two 
consecutive images. Jain recorded some interest¬ 
ing work on the mathematical formulation of the 
process. 4 He developed a difference-picture tech¬ 
nique for dynamic image processing. The crucial 
part of simulating the peripheral process is its pro¬ 
cessing speed. It must be fast enough to allow an 
intelligent machine or robot to interact with the 
instantaneously changing environment. 

Using analog VLSI (very large scale integra¬ 
tion) technology to implement algorithms that 
mimic human visual perception gives us new in¬ 
sights into machine intelligence. Mahowald and 
Mead have constructed a silicon retina. 5 
Hutchinson, Koch, Luo, and Mead have built a 


resistive network to compute optical flow. 6 
Lumsdaine, Wyatt, and Elfadel have worked on 
nonlinear analog networks for image smoothing 
and segmentation. 7 Recent reports indicate that 
some silicon models have been suggested to simu¬ 
late human eye saccadic movement 8 and verte¬ 
brate retinal processing. 9 One of the common 
features of these works is that expensive numeri¬ 
cal computations are performed in an analog way 
with very fast speeds. 

We describe a two-dimensional analog VLSI 
network for implementing a function of motion 
perception during the peripheral process. The net¬ 
work contains many nodes, or processing ele¬ 
ments (PEs), which operate concurrently in 
asynchronous fashion. Each PE performs an 
image-difference operation as well as thresholding 
to extract motion information on a pixel-by-pixel 
basis. Analog voltage, which interacts extremely 
fast, implements all computations. An energy equi¬ 
librium state of the network provides stabilized 
analog voltage across the network—the solution 
to the problem. The structure of each simple PE 
has nice regularity. 

The difference-picture technique 

Implementing vision to build an intelligent 
machine is a most exciting and difficult task. Hu¬ 
man beings perceive environment effortlessly in 
comparison to today’s sophisticated computer vi¬ 
sion systems. Consequently, we should take this 
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suggestion from nature and find an alternative to design im¬ 
age processing algorithms. Many research works have ex¬ 
plored the feasibility of developing computational models to 
imitate human visual perception. 10-12 It is well known in vi¬ 
sual sciences that early motion perception uses a mechanism 
that can be described as a delayed comparison scheme. 13 
The difference-picture technique relies on this scheme. With 
some modifications, researchers have recently applied the 
technique to design motion-compensation algorithms and cir¬ 
cuits for applications in the high-definition television industry. 14 

Given an image E(x,y, t) as a function of both location 
( x, y) and time t, we assume partial derivatives BE ( x, y , 01 / 
dx, [dE( x , y ,01 / dy, BE(x,y, t )] / dt exist. The BE(x,y, 01 / 
dx and BE Oc , y , t)]/ dy , generally speaking, are related to 
the image edge segments, while BE (x,y, f)] /dt relates to 
time. We calculate BE{x, y, 01 / dt as follows: 


dE(x,y,t) __ E(x,y,t - At)- E(x,y,t) 
dt ~ At 


( 1 ) 


where ;c= 0,1, ..., Mand y= 0,1, ..., /Vfor Mx N images with 
At equaling one time unit or sampling interval. For 
nonstationary images, we can use Equation 1 to define sec¬ 
ondary parameters such as EKx, y, t), Dfx,y, t), and 
A (x, y, f) to describe motion. 

Definition 1. We define a difference picture EKx ,y, t) as a 
binary image. Depending on the threshold T of the image 
difference E(x , y , t) - E(x, y, t- dt), each of its pixels has 
two possible gray levels L , and L h . That is 


EKx,y, t) = 



if E(x,y,t)~ E(x,y,t- dt)>T\ 
otherwise. 


( 2 ) 


Here L h = 255 and L ,= 0. 7Ts chosen heuristically and is set 
high whenever the images are badly corrupted by noise. We 
obtain EKx, y, t) based on the pixel-by-pixel determination 
of significant changes of image intensity, with its nonzero re¬ 
gions corresponding to the nonstationary scene. 

We further decompose a difference picture EKx , y , t) into 
two subsets Dfx,y ,t), Dfx , y, t) that correspond to non¬ 
stationary regions of images E(x , y, t) and E(x, y, t- dt). 

Definition 2. We define subdifference pictures D^x, y, t) 
and Dfjx, y, t) as subsets of EKx , y , t), such that D^x , y , t) 
c E(x,y, f), Dfx,y, t) c E(x,y , t- dt), and 

EKx, y, t ) = A y> Ou A (*> A (3) 

as well as 

Dfx,y,t)nD 2 (x,y,t- dt)= (/). (4) 


The relative position between Dfx,y, t) and Dfx,y, t) 
describes the perceived motion direction, the size of each 


nonstationary region, as well as its location. With this infor¬ 
mation, higher level processes of cognition can be issued. 
Computationally, a fast and accurate algorithm simulating a 
peripheral process can serve as a preprocessing stage, which 
can reduce the search space and improve computational speed 
and efficiency. 3 - 4 


Design methodology 

We can realize the difference-picture technique by a 2D 
analog VLSI network. Such hardware implementation pro¬ 
vides very fast computational speed. This network contains 
Mx N nodes, or PEs, with each corresponding to one image 
pixel. Each PE has two functions: computation of image dif¬ 
ference E(x, y, f) - E(x, y, t - dt) and thresholding the 
difference to derive a difference picture EKx, y, t). We can 
alter the overall network either through programming or 
electronically. This capability allows us to work with a large 
class of motion-detection problems. 

Modifiable thresholding unit. To perform analog com¬ 
puting, we first map each gray level of image intensity to 
analog voltage. 

Definition 3■ An analog voltage V i} such that { V f I 0 < V t < 
V dd , i = 0,1, ..., 2551, satisfying the equation of the form 


VT=- 


v dd 


a -Li\ 


(5) 


is said to be a mapping voltage, where L h is the largest gray 
level of image intensity and L, is the smallest one. 

The difference-picture technique gives T in terms of the 
gray level of image intensity which is to be converted to 
analog voltage to provide a specification for hardware de¬ 
sign. In the following, unless otherwise stated, we refer to a 
threshold voltage as a mapping voltage of T. The following 
theorem describes the mapping. 

Theorem 1. If we choose T as a gray-level threshold for a 
given image and if we use a mapping in Definition 3, the 
analog threshold voltage V m equals 


Ko 


= ikl 

Anax 


( 6 ) 


where 0 < T< 255, L ^ is the maximum gray level of image 
intensity. In our case it equals 255. V (ld is a standard 5V power 
supply. 


Proof Following Definition 3, let 


V = V 

K in y t’ 


(7) 


L 


max 


= 4-if 


( 8 ) 


r = 4-4 


(9) 
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Simulating a visual function 


Then, substituting them into Equation 5 gives us Equation 6. 

With this simple mapping, we can now interpret Equation 
2 as one voltage transfer characteristic (VTC) curve with a 
turning point at V m . This VTC has low output if the input is 
low and has high output if its input is high. 

Definition 4. A point on a VTC curve is said to be a thresh¬ 
old voltage V m , if it is the point satisfying 

V =— (V -V ) (10) 

in ^ \ r max min /> 

where and V^ n are the maximum and minimum volt¬ 
ages of the VTC. 

Figure la illustrates the transition of the VTC with a thresh¬ 
old V in . It is not difficult using two stages of inverters to 
realize the VTC. We selected the CMOS (complementary metal- 
oxide semiconductor) technology because it consumed low 
amounts of power. In addition, it produced very little distor¬ 
tion in terms of the input-output voltage range that is desir¬ 
able since the input images ranging from zero to 255 gray 
levels should have the same gray-level range for the output 
image. 

Figure lb shows the circuit configuration. Note that we 
designed the circuit based on quite different performance 
considerations than a digital case. Digital circuit design re¬ 
quires a fixed threshold and fixed VTC. However, they are 
not acceptable here, since the motion-detection scheme must 
have a modifiable threshold and modifiable VTC to handle 
different motion situations and to eliminate random image 
noise. In addition, digital circuit design emphasizes large noise 
margins, which are not the concern in designing our 
thresholding unit. 

One of our design objectives is to implement threshold 
voltage V m according to different 7”s. From the viewpoint of 
hardware implementation, we can relate the threshold volt- 
age 14 to a W/L ratio between the PMOS load and the NMOS 
driver. The following theorem gives the design principle. 

Theorem 2. For a given arbitrary threshold { T I L ,< T< L h ], 
we realize a voltage threshold 14 defined in Theorem 1 by 
defining the W/L ratio using the following form of the equation, 


D(x,yt) 




A V M 

(W/L) p 

V 0 u\2 


D(x,y,t) 
( W/L) n 


(b) 


Figure 1. Curve describing Equation 2, which is to be 
implemented by an analog modifiable thresholding unit 
(a); and CMOS circuit designed to achieve the VTC (b). P 
and N indicate PMOS and NMOS. 


f w' 


v L n j 


\ L p) 


Kn - Vm 


( 11 ) 


where W n and W p indicate the channel widths of the NMOS 
and PMOS devices, and L n and L p represent the correspond¬ 
ing channel lengths. V m and V w are the device threshold 
voltage of NMOS and PMOS transistors. We prove Theorem 
2 as follows. 


Proof Considering a VTC with a threshold V in , since both 
PMOS and NMOS are in saturation mode during the transi- 
tion, we have / DiV(saU = I Dnax ) . That is 

K y K 2 

^GSN ~ V-TN ) = ^SGP — I ^7p|) (12) 

where >0, V m > 0, and K n = KtWj L „), K p = KL W p / 
L p ). 15 From the circuit configuration in Figure lb, we have 

Mn = an d VsGP = V(ld ” Mn- (13) 
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Substituting these into Equation 12, gives us 

^(M„ - V n f = ^{v dd - M„-|V rP |) 2 . 04) 

After some mathematical manipulation we get the final form 
given in Equation 11. 

Obviously, we need the cascaded CMOS inverters so we 
can implement the transfer function given by Equation 2. 

Lemma. For cascaded two-stage cases, the W/L ratio de¬ 
fined in Equation 11 still holds. 

Proof. We provide an instructive proof of the lemma. De¬ 
note V mX , V in2 , V ol , and V u2 as inputs and outputs of the first 
stage and the second stage, where V o1 = V Xn2 . With this rela¬ 
tion, transistors at two stages can both reach a saturation 
mode during time period At. Hence the equation of current 
4wsat.) = 4 pc sat.) holds for the second stage. Therefore follow¬ 


ing the same argument in the proof of Theorem 1, we derive 
Equation 11 to describe the relationship between threshold 
V in and the W/L ratio. 

We can choose a W/L ratio to get a different 7Tor motion- 
detection applications. We can modify the W/L ratio by burn¬ 
ing fuses. For example, we can fuse a wire with l-by-6 
micrometer dimension on a 0.1-micrometer resistive layer. 16 

Analog differencing unit. We realize the differencing 
operation defined in Equation 2 by using a high-gain, large- 
swing, CMOS buffer amplifier suggested by Nagaraj 17 to build 
a noninverting opamp configuration. We connect two input 
nodes V and V + to image E(x , y , t), E(x , y, t— dt) at pixel 
location (x,y) and then connect the output node to the 
thresholding unit. This configuration appears in Figure 2a, 
which defines one node, or called PE, of the network. With 
each of these nodes, we can then construct the 2D analog 

continued on p. 44 


(a) 


(b) 


◄-Differencing unit--Thresholding unit-► 



Figure 2. The noninverting configuration that performs differencing operation of two input images is used as the first 
stage of the PE (a). A schematic diagram of the 2D analog VLSI network (b). 
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Computer Analysis of the 
Myoelectric Signal 


Marco Knaflitz 

Gabriella Balestra 

Politecnico di Torino 


Powerful and relatively inexpensive computer systems allow clinicians to use signal process¬ 
ing techniques previously applied only in basic research. Biomedical engineers have adjusted 
research electromyography techniques to match physicians’ needs. In clinical analysis of the 
myoelectric signal detected with needle electrodes and noninvasive probes, computers are 
crucial. 


E uring the last 50 years, physicians have 
been able to analyze myoelectric sig¬ 
nals collected by needle electrodes 
inserted into muscles. As a result, they 
have gained valuable knowledge about muscle 
structure and function, and about pathological 
processes of muscles and the central and pe¬ 
ripheral nervous systems. (For earlier works, see 
the accompanying box.) Although myoelectric 
signal analysis was worthwhile in clinical prac¬ 
tice, its clinical relevance was definitely more 
limited than that of electrocardiography: The signal 
produced by the heart is much easier to inter¬ 
pret because it is quasideterministic and peri¬ 
odic. With simple amplifiers and monitors, 
cardiologists can extract a large amount of in¬ 
formation about the heart function by studying 
signal morphology in the time domain. Such a 
simple approach is not satisfying with the myo¬ 
electric signal. 

Since the late 1960s, the availability of com¬ 
puter systems and software for digital signal pro¬ 
cessing has encouraged the development of new 
techniques for analyzing the myoelectric signal. 
Researchers focused on two main subjects: 

• the relationships between the myoelectric 
signal collected by surface electrodes and 
the working of muscles and the nervous 
system, and 

• new techniques to extract information about 


the central nervous system’s strategy for con¬ 
trolling motor units. 

Approximately 15 years of basic research and 
the availability of powerful and relatively inex¬ 
pensive computer systems have resulted in the 
development of several new techniques in myo¬ 
electric signal analysis. These techniques are find¬ 
ing ready application in standard clinical practice. 
We review the most promising applications of 
computers to the analysis of both surface and 
needle myoelectric signals. 

Basics of myoelectric signals 

A skeletal muscle comprises a number of mo¬ 
tor units. Each motor unit consists of muscle fi¬ 
bers innervated by the terminal branches of a 
single a-motoneuron whose cell body is located 
in the anterior horn of the spinal cord. The mo¬ 
tor unit is the smallest part of a muscle that the 
central nervous system can control individually. 
The central nervous system controls muscle force 
by adjusting the number of recruited motoneu¬ 
rons and the firing frequency of each motoneu¬ 
ron. These two factors are determined by the 
synaptic activity on the cell body of each mo¬ 
toneuron, which, in turn, is a function of the 
activity of the descending upper motoneurons 
and the afferent activity arriving from peripheral 
sensors. 

Figure 1 schematically represents myoelectric 
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Early work with 
muscle-generated electricity 

In 1666, Italian scientist Francesco Redi inferred that 
muscles generate electricity, and in 1791 Luigi Galvani 
observed the effect of electric current on muscle con¬ 
traction. Using the first galvanometers, Carlo Matteucci 
demonstrated in 1838 that electric currents originate in 
muscles during muscle contraction. 

Studies of the electrical activity of muscles were fre¬ 
quent in the 19th century, and in 1912 H. Piper reported 
the link between myoelectric signal spectral characteris¬ 
tics and muscle fatigue. In 1922, Herbert S. Gasser and 
Joseph Erlanger used the first cathode-ray oscilloscopes 
to observe the morphology of the myoelectric signal. 
Their observations on the electrical activity of muscles 
earned them a Nobel prize in 1944. 

Because of the stochastic nature of the myoelectric 
signal, only rough information could be obtained from 
its observation until I960. Initially, the myoelectric sig¬ 
nal was used as a sign of muscle contraction—for ex¬ 
ample, in studying the activation pattern of lower limb 
muscles during locomotion. In such applications, ex¬ 
tracting the important information from the signal is easy, 
because it is simply represented by the presence or ab¬ 
sence of the signal itself. Since 1930, devices to detect, 
amplify, and display the time course of the myoelectric 
signal have been available. 


Motor units 



Figure 1. Schematic representation of the myoelectric sig¬ 
nal generation process. 


signal generation. We can describe the firing pattern of an 
active motoneuron as a train of 8 functions and the distance 
between consecutive 5 pulses as the interspike interval, which 
is a random variable. The instantaneous firing rate is the 
inverse of the interspike interval. The average firing rate of a 
motoneuron and its standard deviation depend on the syn¬ 
aptic input to that neuron. Firing rates of different a-mo- 
toneurons are, in general, different and uncorrelated. 

Each tenninal branch of a motoneuron innervates a muscle 
fiber in an intermediate point. Through the neuromuscular 
junction, the firing of a motoneuron triggers two depolariza¬ 
tion zones that propagate in opposite directions toward the 
two ends of the muscle fiber, generating two action poten¬ 
tials traveling at a speed called the muscle-fiber conduction 
velocity. 

The sum of the contributions of the fibers belonging to a 
particular motor unit is the motor-unit action potential (MUAP), 
and the sequence of these motor-unit action potentials is 
the MUAP train. 

The myoelectric signal detected at the surface of the skin 
or inside the muscle is the summation of the contributions 
of individual MUAP trains. Because motor-unit discharges 
are irregular and MUAPs have different shapes, we can think 
of the myoelectric signal as a band-limited stochastic pro¬ 
cess with a Gaussian amplitude distribution. Its frequency 
spectrum ranges from DC to approximately 500 Hz. 

Before we can store and observe it, the tissue, skin-elec¬ 
trode interface, electrode configuration, and recording ap¬ 
paratus subject the surface myoelectric signal to a series of 
filtering effects. Thus, the observable myoelectric signal is a 
function of anatomical and physiological factors, which give 
the signal its intrinsic characteristics, and the filtering fea¬ 
tures of the environment and the apparatus, which give it 
extrinsic characteristics. The methodology used to detect and 
record the signal can modify the extrinsic characteristics (see 
box on the next page). The structure and workings of the 
muscle determine the intrinsic characteristics. 

Surface myoelectric signal analysis 

We generally refer to the myoelectric signal collected with 
electrodes placed on the skin above a muscle as the surface 
myoelectric signal. SMES and the myoelectric signal detected 
by needle electrodes mainly differ in: 

• The detection volume of surface electrodes is much 
larger than that of needle electrodes—several cubic cen¬ 
timeters compared with fractions of cubic millimeters. 
As a consequence, signals coming from several tens of 
motor units constitute the SMES, while needle electrodes 
detect a signal consisting of the activity of a few muscle 
fibers belonging to 10 to 15 motor units. 

• Because of the larger distance of muscle fibers from 
surface probes, the SMES power spectrum is generally 
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Signal analysis 


Myoelectric signal detection and amplification 


The term electrode describes a sensor that consists of 
two distinct parts: a detection surface, typically metallic, 
which comes in contact with the skin and senses the myo¬ 
electric signal, and the appropriate housing for the detec¬ 
tion surface. The area of the detection surface affects the 
electrode’s impedance and its detection volume. The de¬ 
tection volume is the space from which the electrode senses 
the signal—the larger the detection surface, the lower the 
impedance and the larger the detection volume. 

Monopolar and bipolar detection techniques. Myo¬ 
electric signals can be detected by different surface elec¬ 
trodes placed on the skin above the muscle to be 
investigated. Figure A shows the two basic techniques, 
generally referred to as monopolar and bipolar 1 In the 
monopolar configuration, only one detection surface is 
placed on the skin above the muscle to be investigated. 
This electrode detects the electric potential with respect to 
a reference electrode placed in an environment unaffected 
by the electrical activity generated by the muscle to be 
studied. 

The bipolar detection configuration yields increased spa¬ 
tial resolution and improved noise rejection. This configu¬ 
ration uses two detection surfaces to sense the voltage 
potential at two locations on the skin with respect to a 
reference electrode. The two signals sensed at the detec¬ 
tion surfaces feed into a differential amplifier, whose out¬ 
put generates the single-differential EMG signal. 

Active electrodes. When relatively 
large common-mode signals are gen¬ 
erated on a patient, a major problem 
is the generation of a differential 
mode signal resulting from common¬ 
mode excitation in the presence of 
an imbalance of the electrode imped¬ 
ances. This situation is rather com¬ 
mon when detecting bioelectric 
potentials from the body surface. 

A high-impedance input stage 
minimizes the effects of any imbal¬ 
ance or excessive amount of the elec¬ 
trode impedances. However, this 
solution presents another problem: 
capacitive coupling between the elec¬ 
trode cables and the power line. A 
solution is to design what is known 
as an active electrode, as shown in 
Figure B. 

An active electrode typically con¬ 
sists of a unity gain amplifier with a 



( 2 ) 


Figure A. Monopolar (1) and bipolar (2) electrode ar¬ 
rangements. 



Figure B. Active detection probe principle. R o represents the output resis¬ 
tance of the voltage follower. R. represents the input resistance of the dif¬ 
ferential amplifier. 
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Figure C. Schematic representation of the myoelectric signal amplifier chain. 


very high input impedance, mounted as close as possible 
to the detecting surface, thus considerably reducing the 
capacitive couplings between the line and the high- 
impedance section of the circuit. The signal is then fed to 
the EMG amplifier on a low-impedance path, which is much 
less sensitive to capacitive couplings. 

Thanks to their high input impedance, active probes 
may be used without skin preparation, shaving, and con¬ 
ductive paste. Moreover, because of their high immunity 
to power line interferences, active electrodes can be ap¬ 
plied in electrically noisy environments (such as hospitals) 
without many precautions. 

The amplifier chain. Figure C shows the myoelectric 
signal amplifier chain. The active circuitry in the active 
electrode performs an impedance transformation. The re¬ 
sulting high input impedance at the detecting surfaces is 
converted into a low output impedance by voltage follow¬ 
ers or noninverting amplifiers. The low input impedance 
preamplifier is generally a differential amplifier. Its low 
input impedance (properly driven by the active electrode) 
lessens the effects of capacitive couplings between elec¬ 
trode cables and the line. The variable-gain amplifier lets 
the user select different gains to maximize the signal-to- 
noise ratio during each recording. 


The main purpose of the isolation 
amplifier is to isolate the stages directly 
connected to the patient from the rest 
of the circuitry and from the common 
ground. We currently use high-linearity 
optical isolators. Because of their intrin¬ 
sic characteristics, optical isolators are 
an important source of noise in the 
amplification chain. For active antialias¬ 
ing filters, fourth- or sixth-order low- 
pass filters should be used with a 
resulting roll-off of 80 to 120 decibels 
per decade. 

Table A lists typical values of the main 
characteristics of myoelectric signal 
amplifiers. 


Table A. Values for myoelectric signal amplifiers. 

Characteristic 

Value 

Input dynamic range 

Overall gain 

CMRR* 

Input-referred noise 
Low-frequency limiting 
High-frequency limiting 

0-10 mV 

pp 

500-10,000 V/V 
> 100 dB 

— 10 J-tV 

pp 

1-20 Hz 

250 Hz-several kHz 

'Common mode rejection ratio 
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limited to 500 Hz, while the power spectrum of the 
signal collected by needles extends to 5 or 10 kHz. 

• Since surface probes detect the action potentials gener¬ 
ated by several tens of motor units, interpretation of 
SMES in the time domain is much more complicated 
than that of a signal collected by indwelling techniques. 
This difference explains why physicians showed scarce 
interest in SMES analysis until a few years ago. 

The main advantage of SMES analysis compared with clas¬ 


sic indwelling techniques is that surface electrodes are not 
invasive. Moreover, because of the relatively large detection 
volume of surface electrodes, SMES yields global information 
about muscle function. 

Computer technology has had a dramatic impact on SMES 
analysis, because no clinical application of it would be pos¬ 
sible without computers. In the following, we review several 
innovative applications of technically and clinically signifi¬ 
cant SMES analysis techniques. 

continued on p. 48 
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Alphorn: 

A Remote Procedure Call Environment for Fault-Tolerant, 
Heterogeneous, Distributed Systems 


Alphorn (pronounced Alp-hom) is a software environment for programming distributed com¬ 
puter systems. Programs running on different computers, possibly of different types and 
running different operating systems, communicate in a client-server relationship by means 
of remote procedure calls. This efficient construct structures programs neatly and relieves 
programmers of caring about where their modules will run. 
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istributed computer systems like those 
used for industrial process control are 
large and complex. Although connect¬ 
ing computers in a local area network 
is relatively easy, programming them requires 
skilled personnel, especially when the project 
becomes large and must be extended at runtime. 
The tools developed by mainframe manufactur¬ 
ers for single-processor machines fall short in dis¬ 
tributed systems. In particular, they do not meet 
the important requirements of industrial control 
systems: distribution, real-time response, and fault 
tolerance. 

Development environments that support the 
whole software cycle, integrating programming, 
distribution, configuration, communication, and 
debugging, are appearing. They try to relieve the 
programmer of dealing explicitly with distribu¬ 
tion, real-time behavior, or replication. 1 Alphorn, 
a programming environment developed by our 


group, goes further in that direction: an applica¬ 
tion consists of a collection of modules, pro¬ 
grammed in standard Modula-2, for instance. 2 
These modules are written as if they were to be 
linked to form a single task running on a single 
processor. Indeed, that is the case when no dis¬ 
tribution is required. 

All communication between modules takes 
place through procedure calls, an approach that 
gives programmers a familiar view of the system. 
Modules interact on a client-server basis. A mod¬ 
ule exports a number of procedures, which are 
the services that client modules will call. The in¬ 
ternal data staictures of the modules are not re¬ 
motely accessible. Programmers must obey the 
rules of data encapsulation. 

In a distributed system, the server and client 
modules run on different computers, different 
processors of the same computer, or different tasks 
of the same processor. Local procedure calls be- 
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come remote procedure calls, but for the programmer, little 
is changed. Copies of a module may be loaded on different 
nodes to provide several instances of a service—for instance, 
to support several printers or to provide fault tolerance. An 
automatic generator makes the services of a local module 
remotely accessible through a stub mechanism. To this ef¬ 
fect, the services of a module are expressed in a service inter¬ 
face language, which is an extension of Modula-2. 

The distribution of modules on the processors and tasks is 
decided only at configuration time, when the work load and 
location of the processors are known. An interactive 
configurator generates the “bag” for the modules and builds 
the tasks. Each task includes runtime support that starts the 
clients, adapts the data presentation, and controls communi¬ 
cation. The data travel over the available media: common 
memory, local area network, or open network. The remote 
procedure call (RPC) protocols are used all along, even for 
communication at the lowest level. 


Alphom has been ported from PDP 11/RSX to VAX/VMS, 
Sun/Unix, VME/VersaDOS, IBM PC, Honeywell/GCOS6, and 
to specialized programmable logic controllers. Although these 
machines use different processors, languages, and compil¬ 
ers, they communicate transparently with the same mecha¬ 
nism. Alphorn has proved useful as a structuring tool for 
large applications as well as a communication tool over a 
variety of networks. 

Distributed process control 

A typical distributed control system consists of computer 
nodes interconnected by a local area network such as Ethernet 
or token bus (IEEE 802.4). The nodes may be of different 
types and run different operating systems. They may them¬ 
selves consist of multiple processors interconnected by a 
parallel bus and sharing a common memory. In process con¬ 
trol, the computers are assigned dedicated tasks, as Figure 1 


Alphorn has proven useful not only as a communication tool 
but also as a structuring tool in large projects. 


RPC protocols have proven their efficiency in process con¬ 
trol. They can be made more efficient if the medium pro¬ 
vides broadcast. However, the Open System Interconnection 
(OSI) protocols of the Intenational Standard Organization (ISO) 
do not support broadcast communications. To support effi¬ 
cient replication and configuration, Alphorn uses a causal 
broadcast protocol, implemented on top of available proto¬ 
cols like TCP/IP. In fault-tolerant applications, this protocol 
supports the actualization of replicated services transparently. 

Several other projects at least partly explore the same ap¬ 
proach. The ISIS system, which set the bases of RPC commu¬ 
nication, provides a fomi of replication. 3 Apollo’s Network 
Computing System and Sun’s Network File System are com¬ 
mercial products that use RPC communication, but they do 
not consider fault tolerance. 4,5 The Advanced Networked Sys¬ 
tems Architecture (ANSA) project and the ESPRIT Delta-4 
project define a general framework for distributed systems 
but have a weak emphasis on fault tolerance. 6,7 

These projects influence the current standardization work 
of ECMA 8 (European Computer Manufacturers Association) 
and especially the work on Open Distributed Processing 
(ODP) supported by ISO/IEC SC21 and the CCITT. These 
standards define an open architecture with many options and 
variants. By contrast, Alphorn ensures communication within 
a dedicated system, allowing us to keep it simple. 


(on the next page) suggests. 

The communication takes different paths, depending on 
the location of the partners: 

• Nodes communicate through the network by sending 
and receiving messages. They do not share a common 
address space. A name server may provide a systemwide 
directory service. 

• Processors within a multiprocessor communicate through 
a common memory, which may be located globally on 
the parallel bus or may consist of dual-ported local memo¬ 
ries. The processors may take advantage of the mes¬ 
sage-passing facility offered by some parallel buses, such 
as Multibus II. They may form a pool or be of different 
types. 

• Tasks executed by the same processor share the same 
physical memory but are prevented from accessing each 
other’s region by the memory management unit. They 
communicate through shared-memory regions using ser¬ 
vices of the kernel. (Tasks are called processes in VAX/ 
VMS or Unix terms.) 

• Threads are parallel activities within the same task. 
Threads (called processes in Modula-2 and light-weight 
processes in Unix) are based on coroutines. They share 
the address space of their task and communicate with 
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Figure 1. A distributed process control system. 


each other through common variables. Threads of the 
same task cannot interrupt (preempt) each other, but a 
thread runs until it reaches a synchronization point. The 
task holds a small scheduler that determines the next 
thread to run. Because of its synchronous nature, a thread 
switch is faster than a task switch. (Threads are the small¬ 
est unit of parallelism we consider here.) 

The communication between these entities is illustrated in 
Figure 2. Since communication paths vary widely, we wish 
to offer the programmer a unified communication interface 
independent of the communication path. 

Traditionally, communication between remote entities has 


been based on messages. Logical channels, which implement 
point-to-point or multipoint links over the same physical 
medium, carry the messages. Channels are called mailboxes, 
pipes, ports, and so on, with subtle variations in meaning. 
The programmer first opens a channel and then sends data 
explicitly by using something like a Send Message operation. 
The partner receives the message by a Receive Message op¬ 
eration. These operations can be inserted at any place in a 
program, allowing great freedom but also great responsibil¬ 
ity to the programmer. Message passing lends itself to spa¬ 
ghetti-code programming, which is hard to debug and 
maintain, especially when several entities write and read to 
the same channel. 
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Figure 2. Parallel entities. 



Figure 3. Calling a service from a server module. 


While the communication system can¬ 
not avoid some kind of message pass¬ 
ing, the application programs should be 
better structured. Alphorn relies on the 
Remote Procedure Call (RPC) as an al¬ 
ternative to message passing. The dif¬ 
ference between messages and RPCs is 
comparable to that between Fortran and 
structured languages without Goto state¬ 
ments. Programming with RPC consid¬ 
erably influences programming style. 
When RPCs are properly used, there is 
no need for other communication con¬ 
structs. 

Programming style 

Programming in Alphorn is based on 
the concept of objects, which is sup¬ 
ported by high-level languages. An ob¬ 
ject consists of data structures and a 
well-defined set of procedures giving 
access to the data. An example of such 
an object is the Oracle database; the 
contents of the database can be accessed 
only through procedures such as 
Getltem, Insertltem, Removeltem, and 
Searchltem. These procedures form the 
object interface, which is all a user must 
know to access the database. The data¬ 
base can be implemented by arrays, lists, 
records, and so on, which the user need 
not know about. 

Modula-2 expresses the object inter¬ 
face in the definition module, which lists 
all exported variables, data types, and 
procedures. The implementation of the 
object is contained in the implementa¬ 
tion module, which is a private affair of 
the object. The database user ignores 
how and where the items are stored. 

A module that exports procedures be¬ 
comes a server when called by a client 
program. An example of a service call 
appears in Figure 3. The server “Bank” 
is defined by a server module. Its server 
interface is defined in the definition file 
“Bank.def.” The procedures exported by 
the definition module (for example, 
Getltem) are the services of that server 
module. The services are programmed 
in the implementation file “Bank.mod.” 

The calling conventions of the client 
continued on p. 60 
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A RISC Processor for Embedded 
Applications Within an ASIC 


RISC processors are generally not suitable for use as ASIC cores due to their die size, power 
dissipation, and pin counts. The VL86C010 processor, however, is well suited to RISC applica¬ 
tions because of its small die, low power dissipation, and low pin count. 


Charles E. Roberts 

VLSI Technology, Inc. 


peed is the ultimate challenge in com¬ 
puting—it always has been and it prob¬ 
ably always will be. Nowhere is this 
I-1 more true than in embedded control¬ 

ler applications in which application-specific ICs 
(ASICs) are often considered for the design solu¬ 
tion. However, while many speed-sensitive em¬ 
bedded controller applications require high 
performance, they also experience the penalties 
such performance levels create—penalties that 
significantly affect ASIC designs. 

Designing a control system to embed in some 
larger product or system usually frustrates design¬ 
ers because of the engineering demons of size, 
weight, power, cost, reliability, and security. De¬ 
signers need a fast, inexpensive, low-power con¬ 
troller that can handle the most functions with 
the least hardware. Low power is important to 
reduce the size, weight, and cost of the power 
supply and reduce the requirement for special 
heat control devices (heat sinks, fans). Unfortu¬ 


nately, speed, power, and cost are usually op¬ 
posing parameters, especially in products for the 
reduced instruction-set computing (RISC) proces¬ 
sor market. 

To achieve RISC performance, designers ex¬ 
pect to pay heavily in terms of power and cost. 
System integration using one or more ASICs is 
the logical route to achieving the goals of size, 
weight, power, and cost reduction, and of in¬ 
creasing system reliability and security. However, 
most RISC processors currently available have die 
sizes, power requirements, and pin counts that 
make them economically infeasible for use in ASIC 
designs. In many cases these RISC processor die 
sizes reach the upper limit of present manufac¬ 
turing capability for practical ICs, and their prices 
reflect this fact. 

Our high-performance RISC processor family 
offers the power, cost, die sizes, and pin counts 
that are well suited to ASIC designs. Designed 
using our IC design tools, these parts naturally fit 
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Figure 1. VL86C010RISC system block diagram. 



VLSI's VL86C010 32-bit, fixed-point RISC processor comes in 
an 84-pin PLCC package and dissipates 100 mW. 


as cores in other ASIC designs that use the same design tools. 
The architecture, instruction set, and features show that this 
family is powerful in its own right, and especially so for em¬ 
bedded control applications. 

A significant difference between this RISC processor fam¬ 
ily and others is that it was designed for maximum system 
performance at minimum cost rather than maximum CPU 
perfonnance at any cost. Designers took a fresh approach to 
the system design and partitioning to provide high perfor¬ 
mance and very high speed interrupt response while keep¬ 
ing the system busing to a minimum (see Figure 1). With 
these goals in mind, they partitioned the components of the 
system (CPU, memory controller, I/O controller, and video 
controller) to take advantage of less-expensive plastic IC pack¬ 
ages and, more importantly, less-expensive dynamic RAM 
devices (DRAMs) by using their page mode of operation. 1 

This partitioning allowed the CPU to be placed into an 84- 
pin plastic leaded chip carrier (PLCC) and the other members 
of the family to be placed into 68-pin PLCCs. It also reduces 
the number of bus lines on a custom ASIC device, one key to 
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Figure 2. VL86C010 RISC processor block diagram. 


the devices’ suitability as cores. The reduced number of bus 
lines and wires to route has a corresponding impact on die 
area and, so, on cost. On some large, complex chips, it is not 
uncommon for designers to use close to 50 percent of the 
area for routing. 

The CPU 

The RISC CPU (VL86C010) is a 32-bit, fixed-point proces¬ 
sor with a barrel shifter and a Booth’s multiplier, as seen in 
Figure 2. It contains twenty-seven 32-bit registers (organized 
as four partially overlapped sets of 16 registers), a three-level 
instruction pipeline, a 64-Mbyte linear address space, and 
full virtual memory support. Even though this processor pro¬ 
duces impressive performance statistics, it typically draws just 
0.1 watts of power at 10 MHz. One particularly interesting 


feature of the instruction set is that every instruction can be 
executed conditionally in the same manner as branch 
instructions. 

The processor has 16 possible conditions (such as zero, 
positive, overflow) coded in the opcode in a 4-bit field. Be¬ 
cause the hardware was already there to decode this field for 
the conditional execution of branch instructions (like virtu¬ 
ally all processors), designers found it an easy matter to make 
all instructions conditionally executable. As a result program¬ 
mers can execute two different segments of code depending 
upon a condition without making a conditional branch over 
one segment of code. Programmers simply include the con¬ 
dition directly in the instructions of that segment. For short 
forward references (which are very common), this approach 
yields more compact and faster code. 
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Figure 3. VL86C110 MEMC block diagram. 


This and other features (discussed later) allow this proces¬ 
sor to outperform other processors clocked at higher rates. 
The higher clock rates produce higher power and faster 
memory requirements that are undesirable for embedded 
control applications. We offer our RISC processor at com¬ 
petitive clock speeds but find it frequently doesn’t need to be 
run that fast because of its high throughput. 

No matter how fast a processor can execute internal in¬ 
structions, it loses value quickly if it doesn’t have an equally 
fast method of communicating with the outside world. For 
embedded applications, this concern is often most important 
in handling interrupts. The VL86C010 contains a classical in¬ 
terrupt request input pin that behaves as on most other pro¬ 
cessors. In addition, the processor features a fast interrupt 
request input pin. 

When this pin is pulled active (low), the processor finishes 
the current instruction, switches the mode to fast interrupt, 
and switches register sets (picking up seven new registers dedi¬ 
cated to this mode). It immediately executes the fast interrupt 
subroutine at a dedicated memory location so that a branch 
instruction doesn’t need to be executed. In this mode the pro¬ 
cessor can respond to the FIRQ instruction in 2.5 clock cycles. 
The worst-case maximum response of 22.5 cycles occurs only 
when block register move instructions are executing and the 


size of the block is set to the maximum of 16 registers. If faster 
worst-case interrupt response is required, programmers sim¬ 
ply don’t make 16 register block moves in the code. As it turns 
out, the worst-case response is not a significant penalty as one 
rarely needs to move all the registers at once. 

Memory controller 

The VL86C110 RISC memory controller (MEMC) handles 
all the timing functions for the DRAMs and uses the fast page 
mode to maximize bandwidth (see Figure 3). 

The MEMC contains DMA address generators for video, 
cursor, and sound data buffers and provides full support for 
logical-to-physical address translation for virtual memory. Of 
particular interest is the fact that the 32-bit data bus does not 
connect to the MEMC. This feature affects the core size of the 
MEMC as well as the die size of an ASIC that would use the 
MEMC because of the reduced busing discussed earlier. While 
speed is all important in RISC applications, die size is all 
important in ASICs because of the cost. 

Video controller 

The VLSI VL86C310 video controller (VIDC) contains all 
the circuitry to implement color video and stereo sound in- 

continued on p. 68 
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Performance and the 
i860 Microprocessor 


The internal design of the i860 CPU exploits pipelining and parallelism more than previous 
microprocessors. It uses RISC concepts and memory-performance optimizations in several 
novel ways. Understanding these features will help programmers optimize code and system 
designers maximize throughput. 


Mark Atkins 

Intel Corp. 


U n 1989 Intel introduced the i860 XR (Fig¬ 
ure 1) processor as the fastest micropro¬ 
cessor possible in a million-transistor 
chip. 12 It performs at 40 MHz without 
external cache memory at 26.7 Specmarks. Con¬ 
tributing to this speed are instructions tuned for 
parallelism, internal caches with a 960-Mbyte/s 
internal bandwidth, and an 8-byte external bus 
operating at 160 Mbytes/s. In 1991, Intel intro¬ 
duced the i860 XP microprocessor, a larger, faster 
version with added features. Most of the topics I 
discuss here apply to both chips. 

Architecture versus 
implementation 

People generally define a computer’s archi¬ 
tecture as the programmer-visible instruction set, 
registers, and memory arrangement. Most archi¬ 
tectures have a variety of different implementa¬ 
tions (actual hardware boxes or chips conforming 
to the architecture) at various cost levels. 

However, architecture and implementation are 
closely intertwined. We can’t am benchmarks 
on architectures, but an obsolete architecture can 
slow implementations. For example, the com¬ 
plexity of the IBM 370 renders a given amount 
of hardware slower than the same hardware de¬ 
voted to a RISC (reduced instruction-set com¬ 
puter) machine. 3 Computer buying and design 
decisions usually consider implementation more 
than architecture. In addition, possible hardware 


implementations influence initial specification of 
the architecture. The feasibility of a million- 
transistor chip allowed the i860 designers to in¬ 
clude features like virtual memory, floating point, 
and 3D-graphics support. 

The architectures of most RISCs are very simi¬ 
lar. All attempt to give the best possible perfor¬ 
mance per dollar, using: 

• 32-bit instructions, data, and addresses; 

• large register files, to avoid (slow) memory 
accesses; 

• pipelining; 

• caches for fast instruction and data access; 

• register-to-register operations, to avoid 
memory accesses; 

• “reduced” number of instructions, to avoid 
long design periods, debugging efforts, and 
silicon space for microcode; and 

• delayed branches, which execute the instruc¬ 
tion after the branch to accomplish useful 
work while fetching the branch target. 

While the i860 CPU implements all the RISC 
rules, it adds several novel items to enhance per¬ 
formance. Even when software does not exploit 
the extra parallelism features, the i860 still achieves 
efficient RISC performance. As compilers grow 
more sophisticated, they may produce faster code 
through simultaneous parallel execution of inte¬ 
ger and floating-point (or graphics) instructions. 
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The i860 CPU implementation 

The most notable difference between 
the i860 XR processor and other CPUs is 
its single-chip implementation. This one 
piece of silicon contains: 

• an integer unit, including thirty-one 
4-byte registers; 

• 32 floating-point registers, an adder 
unit, and a multiplier unit; 

• a graphics unit; 

• a 4-Kbyte instruction cache; 

• an 8-Kbyte data cache; and 

• a memory management unit with 
64-entry translation look-aside 
buffer. 

This approximately $300 chip makes 
systems less expensive in both chip cost 
and system engineering effort than a 
separate integer chip, floating-point chip, 
cache chips, and cache controller chip 
found in competing RISC machines. Most 
chip suppliers now are working on 
single-chip solutions to be introduced in 
1991. 

Beyond cost considerations, a single 
chip offers advantages from wide, fast 
buses, which are feasible only when they 
do not cross chip boundaries. The on- 
chip cache bandwidth, which is 960 
Mbytes/s at 40 MHz (Figure 2), includes: 

• a 320-Mbyte/s instruction band¬ 
width (two executed each clock 
cycle) and 

• a 640-Mbyte/s data bandwidth (16 
bytes fetched or stored each cycle). 

While the 12 Kbytes of on-chip cache is 
less than would be possible with sepa¬ 
rate external cache chips, it contributes 
to performance significantly with a hit 
rate typically exceeding 90 percent. 

Multiport register files. The regis¬ 
ter file also is crucial to on-chip band¬ 
width. For example, the 32-bit ports on 
some processors reduce double-preci¬ 
sion, 64-bit floating-point performance 
by requiring two accesses to secure one 
operand. In the i860 CPU, the floating¬ 
point registers’ five 64-bit ports can fetch 
three double-precision numbers in each 
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Figure 1. The i860 XR processor block diagram. 
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Figure 2. Internal and external bandwidths at 40 MHz. 
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i 860 performance 


clock cycle—two numbers for an add (or multiply) operation 
simultaneously with one for a store-to-memory step. Simulta¬ 
neously, the register file can write the result of a previous 
add (or multiply) and the result of a memory load. The inte¬ 
ger registers contain two read ports and one write port, al¬ 
lowing the simultaneous three-operand accesses implied in 
instructions such as ADDU R1,R2,R3. The register bandwidth 
totals 2 Gbytes/s. 

Pipelining. The instruction processing pipeline of the i860 
processor, shown in Figure 3, includes four stages assigned 
the acronym FDEW: 

• fetch the instruction from the cache or memory; 

• decode it and access the register file(s); 

• execute the arithmetic, shift, or logical operation; and 

• write the result into the register file. 

Those four segments of processing usually require similar 
amounts of time due to hardware considerations, and thus 
most RISC processors use a four-stage pipe. Figure 3 is a 
pipeline resource diagram, plotting time against hardware 
and instructions. Such diagrams assist designers in avoiding 
resource conflicts. By drawing sequences of instructions and 
their hardware usage, designers can detect conflicts such as 
overlapping register-port usages. As long as the same hard¬ 
ware (such as the ALU, abbreviated E in the diagrams) is not 
used by two instructions simultaneously, the pipeline works. 

The pipeline for both conditional and unconditional 
branches works in four stages abbreviated FDIW, similar to 
FDEW: 

• fetch the (branch) instruction; 

• decode and generate the branch target address; 

• access instruction cache for target and decode the in¬ 
struction in the branch delay slot (if conditional branch, 
check condition code bit to determine if taken); and 

• write the return address to the register file (if this is a 
subroutine call) and execute the instruction in the branch 
delay slot. 

If a conditional branch (BC.T, BNC.T) is not taken, the pipe¬ 
line takes an additional clock cycle to restart the fetch of the 
sequential stream. Of course, if the sequential instruction is a 
cache miss, or the target is a miss for a taken branch, the CPU 
initiates an external bus fetch and suspends execution. 

A dedicated branch adder unit calculates the target ad¬ 
dress using the current program counter and an offset amount 
embedded in the current instruction. The 26-bit offset de¬ 
notes 4-byte instaictions, giving a 236-Mbyte direct target 
range. To allow a full 4-Gbyte branch range, the CPU also 
implements indirect branches whose target comes from a 32- 
bit register. 

The on-chip instruction cache shortens the branch pipe- 


Code fragment; 

addu src1,src2,dest 
br target 

subu src1,src2,dest 

(more instructions here from sequential stream) 
target: 

subs src1,src2,dest 
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Figure 3. Code fragment (a) and instruction pipeline (b). 

line to four stages, when it might otherwise be five, with two 
clock cycles required in the I stage of FDIW. Were the cache 
external, the processor would have to drive the address on 
the external bus, propagating through the external cache and 
transferring the instruction via the external bus. 

Floating-point pipelines. While RISC implementations 
universally follow the FDEW scheme for integer operations, 
they vary much more in the floating-point arena. Here the 
i860 CPU excels because it implements a low-latency and 
high-throughput three-stage execution pipe. Low latency is 
essential to performance when executing serial algorithms, 
in which each calculation uses results from the previous cal¬ 
culation. The floating-point adder and multiplier each have 
three stages, which amount to 75-ns latency at 40 MHz. Actu¬ 
ally, the multiplier uses two stages at two clocks each for 
double-precision operands. 4 The graphics instructions also 
use the floating-point registers for additions and compari¬ 
sons on integer values, but are faster at one clock cycle each. 

Including fetch (F), decode (D), and write-back (W) with 
the three execute (E) stages yields six stages overall. Two 
versions of floating-point instructions exist, as illustrated in 
Figure 4: 

• scalar, in which the E stages of successive floating-point 
operations do not overlap, and 

• pipelined, in which the E stages do overlap. 

Note from the figure that the integer and floating-point 
operations can overlap; only consecutive scalar floating¬ 
point instructions cannot overlap the E stages. The separate 
floating-point and integer register files allow writes to be 
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addu src1,src2,dest1 
fadd fsrcl ,fsrc2,fdest2 
fsub fsrcl,fsrc2,fdest3 
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subu srcl ,src2,dest5 
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processed out-of-order as in the case of the FMUL (floating¬ 
point multiply) and integer SUBU (subtract) in the figure. 
The pipelined floating-point instructions write the result from 
a previous instruction simultaneously with the second E stage, 
creating an FDEW e E sequence. 

Explicit pipelines. The pipelined mode is similar to mi¬ 
crocode on array processors in which the instruction speci¬ 
fies two source operands but does not specify the 
corresponding destination register. Instead, the destination 
field of the pipelined floating-point operation receives the 
result from the third previous floating-point instruction. That 
is, floating-point instruction I updates its RDEST with data 
generated by floating-point instruction 1-3. This explicit 
pipelining makes programming a bit tricky because the pro¬ 
grammer or compiler must keep track of the contents of the 
floating-point pipelines. However, it also yields a 3-to-l 
speedup for floating-point-intensive software. At 40 MHz, 
scalar code achieves 13 Mflops (million floating-point op¬ 
erations per second), but pipelined code yields 40 Mflops. 
Software libraries for i860 graphics and vector math exploit 
pipelined floating point, and compilers have begun to gen¬ 
erate it. 

Additionally, the explicit pipelining keeps more floating¬ 
point registers available to the program. If FADD F5,F6,F7 


updates F7 with the sum F5 + F6, then 
F7 cannot be used in the next three in¬ 
structions without causing a hardware 
interlock (delay) for those instructions 
to wait for the correct F7 value to emerge 
from the execution pipeline. 

Finally, explicit pipelining simplified 
the hardware implementation, remov¬ 
ing the need for controls to check for a 
just-mentioned case (instructions at¬ 
tempting to use a floating-point register 
whose contents are obsolete). The avoid¬ 
ance of that checking logic is also the 
reason that scalar floating-point instruc¬ 
tions wait for their predecessors to fin¬ 
ish the E stages. The name “scalar” 
denotes the type of code in which the 
destination register of an instruction be¬ 
comes a source in the next instruction, 
in which case the interlock is necessary 
anyway. 

Even without architectural changes, 
scalar throughput can improve to one 
clock cycle in future implementations 
that provide interlock checking. Then 
both scalar and pipelined code can 
achieve 1 Mflops per megahertz. 

Some computer architects criticize the 
exposed floating-point pipelines and the 
embedding of their length into software, because future tech¬ 
nologies might allow shorter pipes. However, any software 
contains implicit assumptions of pipeline length, as the in¬ 
struction scheduling for avoidance of interlocks requires la¬ 
tency knowledge. Like the Mips R2000/R3000 architecture 
with the memory-load pipeline length exposed to software, 
the floating-point exposure allowed simpler hardware and 
software optimization. And like Mips-2 with the load-delay 
slot removed, future considerations may require redesign¬ 
ing the floating-point pipelines. However, a compatible three- 
stage floating-point mode will always be supported in the 
hardware to run old “dusty deck” code, that is, old pro¬ 
grams that current users cannot, or will not, rewrite. 

Duality and six-stage pipelines. To avoid unnecessary 
instruction sequencing delays, the chip incorporates two fea¬ 
tures. I use “unnecessary” to refer to the time the floating¬ 
point unit idles while integer instructions execute (or vice 
versa), and the time the multiplier idles during adder operation. 

One feature, dual-operation floating-point instructions, al¬ 
lows both the floating-point adder and multiplier to work in 
multiply-accumulate fashion. This kind of parallelism is com¬ 
mon in digital signal processor chips and complex numbers. 
In dual operations, the three-stage adder and multiplier pipe- 

continued on p. 72 


Figure 4. Floating-point instruction pipeline: Code fragment containing both in¬ 
teger and floating-point operations (a), scalar instructions (b), and pipelined in¬ 
structions (c). A dash indicates an idle clock cycle. 
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IEEE Standard PI 754: An Open Microprocessor Architecture 


[The second part of our series on tools addresses 
a standard ready for final balloting. Rudolf 
Usselmann discusses the implications of this stan¬ 
dard and stresses its advantages. Please note the 
balloting date in the last paragraph. 

I invite readers to send me input regarding a 
tool or method that helps solve problemsforfuture 
columns in the series. — C. WJ 


Rudolf Usselmann 
Sparc International 


any of today’s existing microproces¬ 
sors are vendor proprietary architec¬ 
tures and not subject to industry 
consensus. In September 1990, Sparc Interna¬ 
tional, Inc. initiated a P1754 (IEEE standards) 
working group. Its goal was to define an open 
microprocessor architecture that is not bound to 
a specific technology and is available to a wide 
variety of manufacturers and users. 

PI754 is the first open microprocessor archi¬ 
tecture to be accepted as a standard by the IEEE. 
Some people expressed an understandable skep¬ 
ticism about the outcome of the project due to 
the number of players involved. Many did not 
think representatives from so many major com¬ 
panies would be able to agree on one architec¬ 
ture that would be feasible to implement. After 
many meetings, five PI754 drafts, megabytes of 
e-mail, and endless phone conversations, PI754 
is now in its final stage in the standard develop¬ 
ment process. 

The outcome of the P1754 project is a 


nonproprietary microprocessor architecture 
based on the existing Scalable Processor Archi¬ 
tecture (Sparc) developed by Sun Microsystems. 
The standard includes a definition of the instruc¬ 
tion set architecture (ISA), register model, data 
types, instruction opcodes (including floating¬ 
point instructions), trap model, coprocessor in¬ 
terface, and other architectural aspects. 

PI754 is an architecture, not an implementa¬ 
tion. It is very important to highlight this state¬ 
ment. The standard does not limit you to follow a 
specific method when designing a device, it 
rather encourages you to be creative, leaving 
“back doors” open so you can add your own 
extensions to the architecture, without violating 
the standards definitions. Many of you will ask, 
What is a standard good for if it allows you to 
add your own extensions? Let me answer this 
with an example. 

Say that company XYZ is implementing a 
PI754 CPU and the company’s designers want to 
incorporate a graphics accelerator on the device. 
Now, they can use the implementation-depen- 
dent extension feature to include additional in¬ 
structions and registers on their device and still 
be compliant with the standard. To make use of 
the extra functionality, the application executes a 
call to the operating system, which enables the 
extensions, executes the requested function, and 
then switches back to a compliant mode. In other 
words, system-dependent libraries, which are 
runtime linked to applications, can exercise the 
extensions totally invisible to user software. 

To further emphasize the flexibility and thor¬ 
oughness of the architecture, let me point out the 
register window definition. An implementation 
may have from two to 32 sets of windows with 
each set containing 24 registers (of which eight 
are shared with the next window, thus only 16 
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Figure 1. Register windows. 


physical registers per window). It also 
has a set of eight global registers that 
are always active, regardless of which 
window is selected. This approach 
means that an implementation may 
have anywhere from 40 (two sets + 
eight global) to 520 (32 sets + eight glo¬ 
bal) registers. 

Figure 1 demonstrates the circular 
arrangement of a register window with 
six window sets. Two instructions ro¬ 
tate the window: Save and Restore. 
They manipulate the current window 
pointer (CWP) to open a window and 
save the current one (when calling a 
subroutine) or to close the current win¬ 
dow and restore the previous window 
(when returning from a subroutine). If 
too many levels of Save or Restore oc¬ 
cur, the system generates an overflow 
or underflow trap and allows the sys¬ 
tem software to save or restore the reg¬ 
ister file to/from memory. 


Eight registers always overlap be¬ 
tween two windows, allowing a 
straightforward communication be¬ 
tween subroutines. A program want¬ 
ing to pass arguments to a subroutine 
will put the data in its current out reg¬ 
isters and then perform a Save and a 
Call to the subroutine. The subroutine 
acknowledges the arguments now in 
its in registers. Upon return, the sub¬ 
routine may modify its in registers, 
which become the out registers of the 
calling procedure. 

Since P1754 defines a 32-bit archi¬ 
tecture, the number of register windows 
is limited by the size of the window 
invalid mask (WIM) register. This reg¬ 
ister is 32 bits wide with one bit per 
window; it determines window over¬ 
flow and underflow conditions. Other 
unique aspects of this architecture in¬ 
clude the fairly rich set of memory ad¬ 
dressing modes and the support of 


tagged data structures and supporting 
instructions. 

P1754 does not define an interface 
for this architecture, thus leaving it to 
the implementers and allowing them 
to make this decision depending on 
their needs. With an independent in¬ 
terface architecture, the specifications 
may require building devices in many 
different speeds and performance 
grades. For a low-cost implementation, 
a developer may decide to include a 
simple cache and a single 32-bit bus. 
For a high-performance implementa¬ 
tion, a developer may decide to de¬ 
sign a superscalar device with two 
different 64-bit buses for data and in¬ 
structions. Other specifications may re¬ 
quire gallium arsenide technology over 
CMOS, while even others may desig¬ 
nate ECL. With this flexibility, it is al¬ 
most guaranteed that companies will 
be interested in developing multiple- 
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performance workstations that can take 
advantage of binary compatibility. See 
Figure 2 and Table 1. 

What does it mean to have an open 
microprocessor architecture standard? 

• Software compatibility. By fol¬ 
lowing a simple set of rules, soft¬ 
ware developers can ensure the 
portability of their software be¬ 
tween all P1754-compliant sys¬ 
tems, be it a high-performance 
superscalar micro or a low-cost 
desktop workstation. This set of 
rules, or Application Binary Stan¬ 
dard (ABS), outlines the features 
of PI754 that may prevent port¬ 
ability (supervisor instructions and 
implementation-dependent exten¬ 
sions) and is not recommended 
for portable code. Software devel¬ 
opers targeting the workstation 
market can now easily shrink¬ 
wrap their software packages and 
sell them in retail stores as other 
developers do for PCs. 

• Ease of implementation. Devel¬ 
opers of PI754 implementations 
can focus all their energy on im¬ 
portant issues, such as develop¬ 
ing better technologies, higher 
integration, better interfaces, and 
more powerful memory models 
and cache subsystems. Hardware 
engineers can devote their time to 
developing the internal data path, 
ALUs, and other logic blocks, with¬ 
out worrying about defining the 
architecture. 

Other implementation chal¬ 
lenges may include caches, direct 
DRAM interfaces, high-perfor¬ 
mance floating-point units, and 
unique bus interfaces to enhance 
the per-formance or to lower the 
overall cost. Packages and power 
dissipation will become more im¬ 
portant as the popularity of this 
standard increases and will help 
system developers to build real- 
world laptop and notebook 
systems. 

• System compatibility. System 



Figure 2. PI754: One standard, multiple implementations, and one-time soft¬ 
ware development. 


Table 1. Offerings of P1754 and other architectures. 


68000, 88000, 


Features 

i860 

PI 754 

Interface/bus flexibility 

Very limited 

Many 

Architectural extensions 

None 

Many 

Scalable architecture 

No 

Yes 

Number of implementers 

One 

Many 

Performance grades 

Few* 

Many 

*Mainly different clock speeds 




developers will be able choose 
between a vide variety of PI754 
implementations, and can select 
the right parts for the overall sys¬ 
tem performance. They won’t 
have to invest large amounts of 
money and manpower to rewrite 
software with the wide variety of 


portable application available. Us¬ 
ing open bus architectures like 
SBus or Futurebus will even al¬ 
low developers to swap periph¬ 
erals and extensions among all 
workstations. System developers 
can now enter the workstation 
market and compete with others 
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based purely on the performance 
of their workstations and not on 
the number of available software 
packages. 

Look back at the model of the IBM 
PC. It took several years until the stan¬ 
dard emerged. Can’t we see many more 
advantages in a widely accepted stan¬ 
dard? Having a strong base of support¬ 
ers for PI754 can make it “the” standard 
for the 90s. We are already seeing the 
emergence of open system buses like 
the SBus, MBus, Futurebus+, and oth¬ 
ers, which will contribute to the devel¬ 
opment of open systems and evolve 
into a large base of extensions for users. 

In June 1991 JTC1-SC26 (the Joint 
Technical Community on Information 
Technology) approved sponsorship of 
a working group to make PI754 an 
international standard.Balloting for 
PI754 occurred in September 1991; 
we’ll have a standard by the end of 
1991 or early 1992. 

Rudolf Usselmann, a senior architect 
at Sparc International in Menlo Park, 
California, develops extensions to the 
Sparc architecture and related products 
such as the SBus and MBus and per¬ 
forms compliance testing on Sparc pro¬ 
cessors and peripherals. He has worked 
for Lattice Semiconductor Corporation, 
LSI Logic Corporation, and Sun 
Microsystems, and has owned and op¬ 
erated his own computer systems com¬ 
pany in Germany. 

Usselmann holds a Diploma in com¬ 
puter science and electrical engineer¬ 
ing from the University of Hamburg. 
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Fraud on the Copyright Office 


Q udge Hatter had computer pro¬ 
gram copyright owners worried 
for a while after he held all of 
Ashton-Tate’s dBase copyrights invalid 
because Ashton-Tate had committed 
fraud on the United States Copyright 
Office. (See Micro Law, June 1991.) 
Since his December 1990 ruling, how¬ 
ever, he changed his mind and decided 
that maybe Ashton-Tate wasn’t so 
fraudulent after all. The dBase copy¬ 
rights have therefore been resurrected, 
for the time being at least. 

Nonetheless, his original ruling 
serves as a warning that it is risky to 
be careless about what you put on 
applications filed in the Copyright Of¬ 
fice for registration of copyrights and 
mask works. Ashton-Tate’s problem 
concerned its failure to disclose what 
parts of the work it submitted for reg¬ 
istration came from prior sources. That 
probably remains the most likely source 
of a fraud problem, and should be con¬ 
sidered first. 

Application forms for registration of 
computer programs (Copyright Office 
fonn TX) and mask works (Copyright 
Office form MW) require the applicant 
to state whether any of the material 
sought to be registered comes from a 
preexisting source, and if so to iden¬ 
tify it. The form for computer programs 
directs the applicant to “identify any 
preexisting work or works that this 
work is based on or incorporates” 
(Space 6a). The form for mask works 
merely directs the applicant to “describe 
the new, original contribution in the 
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mask work” (Space 8). But it mentions 
that mask works generally contain old 
material that is not protectible in itself, 
and directs the applicant to point out 
the new, protectible material furnish¬ 
ing the basis of the claim of mask work 
rights. 

It has been a common, careless prac¬ 
tice to leave Space 6a of the copyright 
form blank and to write in Space 8 of 
the mask work form (and in a similar 
space in the copyright form) some 
phrase such as “entire work.” But of¬ 
ten, that is not true, or at least it some¬ 
what fudges the truth. 

For example, in one now-pending 
case involving a chip design, the chip 
incorporated a large module from a 
prior chip that the applicant had mar¬ 
keted; the applicant failed to mention 
that fact. When the applicant sued a 
competitor, it claimed fraud on the 
Copyright Office. It may well be that 
the Copyright Office would have issued 
a registration on the chip, if full disclo¬ 
sure had been made in the first place, 
since there may have been protectible 
material in the part of the later chip 
not taken from the earlier chip. Yet, 
the information about the earlier chip 
was omitted in the face of an express 
directive to distinguish the new and old 
parts of the chip. What is worse, when 
the owner of the mask work sued its 
competitor, it took the position that the 
whole chip had been copied and in¬ 
fringed, even though a major fraction 
of it was the same as the earlier chip 
(which was in the public domain). 
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Whether an applicant that does that 
should lose all rights or just be slapped 
on the wrist is something the courts 
will have to decide. But if you are the 
applicant, why get involved in this kind 
of mess? The cost-benefit ratio is 
unfavorable. 

A check list of what to be careful 
about in filling out Copyright Office 
registration forms should include sev¬ 
eral other items that can later cause 
trouble, if incorrectly stated: 

• Name of author. The author of a 
computer program is the person 
who wrote the code, unless that 
person was an employee. A con¬ 
sultant or independent contractor 
who wrote the code is not an 


employee, and is therefore the au¬ 
thor. What is more, unless a 
nonemployee author signs a writ¬ 
ten assignment of the copyright 
in favor of the client, the author 
owns the copyright. Further, if 
someone else writes the code, 
thinking up none of the follow¬ 
ing makes you the author: having 
the basic idea of the program, cre¬ 
ating its algorithms, determining 
what functions or tasks it will per¬ 
form. 

Year of creation, date of first 
publication. It is easy to get these 
dates wrong, particularly when the 
program is written outside the 
United States. Language problems 
frequently result in mistakes. 


When there are several versions 
or iterations of a program, confu¬ 
sion may occur over which one 
should be the basis of the regis¬ 
tration. The same thing occurs 
with chips. Yet, even an innocent 
error about date is likely to lead 
to an infringer’s claiming that the 
applicant committed fraud on the 
Copyright Office in procuring the 
registration. I was involved in one 
case where that led to a blatant 
infringer’s staving off a preliminary 
injunction for a year, because the 
mistake created doubt about the 
copyright owner’s case. The result 
was nearly complete destruction 
of the copyright owner’s business. 

Much, or even most, of the time 
charges of fraud on the Copyright Of¬ 
fice are just a smoke screen that in¬ 
fringers create to hide their piracy. But 
some of the time, applicants really do 
try to hide the facts to deceive the 
Copyright Office and to dupe the pub¬ 
lic into paying tribute where it is 
undeserved. Sometimes, the tactic suc¬ 
ceeds in slowing down competitors or 
scaring them off. 

To learn into which category the 
Ashton-Tate case falls is something we 
will have to discover later. Whatever 
the answer is, however, it is clear that 
you do not want to be dragged into 
such a controversy if you can avoid it. 
It therefore pays to take care to write 
down correct answers to the questions 
on Copyright Office forms. 
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A glimpse of the future 




hat follows is a potpourri of recent 
items that were of interest to me, 
I sometimes because of their relation 
to a report I wrote in the past, or because they 
provide a glimpse of the future. I plan to follow 
up on a few of these in more detail, but some 
readers might find these previews useful. I apolo¬ 
gize for the random nature of these short notes. 


end of 1991, another 30 percent increase is pro¬ 
jected, with X terminals doubling to about 10,000 
units. 

Toshiba has revised upward by 100 percent 
its sales goals for laptop workstations. The new 
LCD production plant allows Toshiba to produce 
about 1,000 units/month. 

Fujitsu plans a complete remodel of some of 
its workstation series to be compatible with Sun. 


Sales: LCDs, workstations 

LCD sales are booming; Sharp, Sanyo, 
Matsushita, Toshiba, and others project very large 
sales boosts. Color liquid crystal TVs and projec¬ 
tors are expected to reach 1.8 million and 130,000 
units respectively. LCDs are also being devel¬ 
oped by Alps Electric, NEC, Mitsubishi, Asahi 
Glass, and so on. This key technology will play 
an important role in future computers. And, of 
course, Japan producees the best (largest and 
brightest) LCDs. Also, consumer electronics tech¬ 
nology feeds into computers, and firms known 
in the West for their consumer electronics (Sony, 
Sharp, Seiko) are also heavily into computer 
activities. 

Seiko plans an LCD research and development 
center in Italy this year with 20 to 30 research¬ 
ers. Seiko produces 10-inch PC LCDs. Omron 
also has an LCD with 320 x 160 dots, a contrast 
ratio of 8:1, and view angles of 45 degrees in all 
directions. It plans to produce an eight-color 
panel. Matsuzaki Vacuum plans a $75-million US 
plant designed to produce 100,000 ten-inch pan¬ 
els each month. 

In 1990, engineering workstation sales jumped 
53 percent to about 98,000 units. Of these, Sun 
Microsystems produces about 25 percent, Sony 
15 percent, and NEC 10 percent. X-window ter¬ 
minals are estimated at about 5,000 units. By the 


Microcomputers 

A 1990 edition of Techno Japan contains two 
lengthy articles surveying microcomputers that 
were designed and built entirely in Japan. 1 It 
describes in detail products from NEC, Hitachi, 
TRON, and so on. 

One of the most interesting remarks in the 
article is a claim that Japan has not caught up 
with the US in this technology. The only area in 
which the authors claim that Japan is on a par 
with the US is in DRAM production. 

One article states, 

Given then that the Japanese industry is 
leading in the manufacture of DRAMs, 
one should note that the most important 
technology in the field is not that of 
DRAMs but rather of MPUs. The world 
market of MPUs is now dominated by 
Motorola and Intel, and all Japanese com¬ 
puter manufacturers employ these chips 
in their machines. Furthermore, US manu¬ 
facturers have amassed a considerable 
amount of MPU-related software upon 
which Japanese manufactures rely 
heavily. In addition, license fees must be 
paid to US firms by Japanese companies 
manufacturing semiconductor products, 

continued on p. 79 
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Computers/systems 


Desktop performs at 21 Specmarks 

The Tempest-compliant Sparcstation 2GX 
graphics workstation targets RISC/Unix govern¬ 
ment users and contractors requiring secure hard¬ 
ware capabilities. Along with 16 Mbytes of 
memory and accelerated 2D graphics, the desk¬ 
top features a key-lockable front door for con¬ 
trolled access. A 40-MHz CPU offers 21-Specmark, 
28.5-MIPS, and 4.2-Mflops performance ratings. 

The system is binary compatible with the 
company’s Sparc-based computers. The standard 
Tempest system includes a 19-inch color moni¬ 
tor, 8-bit 2D/3D graphics (using the GX graph¬ 
ics accelerator), a 3.5-inch DOS-compatible 
floppy disk drive, keyboard, and mouse. Sun 
Microsystems, Inc.; from $29,995. 

Reader Service No. 10 

Carry a pen-based notepad 

Field professionals and technicians can print 
directly on a digitized screen with a cordless pen 
to input data for processing and analysis when 
using the NCR 3125. The 3-9-lb. notepad recog¬ 
nizes upper- and lowercase block print letters 
and is said to adapt to multiple handwriting styles. 

The 386 SL-based NCR 3125 supports up to 
16 Kbytes of cache memory and MS-DOS, Mi¬ 
crosoft Windows for Pen Computing, and Go 
Corporation’s Pen Point operating systems. Stan¬ 
dard 4-Mbyte memory can be expanded by us¬ 
ers to 20 Mbytes in 2- and 4-Mbyte increments 
of DRAM or EEPROM in SIMMs or IC cards. NCR 
Corporation; $4, 765 including chaiger, carry¬ 
ing case, and stylus. 

Reader Service No. 12 

Color laptop features the 486 

The 16.8-lb. Pro Speed 486SX/C laptop com¬ 


puter contains a 10-inch, Super VGA, thin-film 
transistor, active-matrix color screen designed for 
high-end users. Support for 236 colors, 640 x 
480 resolution, 120-Mbyte hard drive, 2-Mbyte 
standard memory (expandable to 20 Mbytes), 
and a built-in 32-bit EISA slot help the Pro Speed 
provide optimal expansion for networking, im¬ 
aging, and engineering applications. The laptop 
runs Windows 3-0 and MS-DOS 5.0 as operating 
environments. NEC Technologies, Inc.; $8,999. 

Reader Service No. 11 



NEC Pro Speed 486SX/C laptop 


Turnkey storage system 

A storage server designed for RISC System/ 
6000 workstations runs visualization, simulation, 
and high-speed data acquisition and analysis ap¬ 
plications. With up to forty 5.75-in. ESDI hard¬ 
disk drives working together, the RS/6000 RAID 
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storage server delivers 18 Mbytes/s 
sustained data transfer rates and up to 
43-2 Gbytes of data storage. Maximum 
Strategy; from $112, 750. 

Reader Service No. 13 

Portable offers five internal 
expansion slots 

For added functionality, a 20-MHz 
386SX portable computer offers mo¬ 
bile users five uncommitted, full-size 
ISA slots. The P.A.C. SX-20C includes 
a 16-Mbyte system DRAM, 64-Kbyte 
cache SRAM, and a standard red gas 
plasma display with 16 shades of gray 
scale at 640 x 480 VGA resolution. A 
standard P.A.C. SX-20C configuration 
also includes a 2-Mbyte RAM; 40-Mbyte 
hard disk drive; and 1.44-Mbyte, 3.5- 
in. floppy disk drive. 

For a full-color, CRT type of display, 
the company offers an optional active 
matrix thin-film transistor, LCD flat 
panel with 24,389 colors. Dolcb Com¬ 
puter Systems; from $5,995. 

Reader Service No. 14 



Dolch Computer Systems P.A.C SX-20C 


Slide-in components 

Super Gen incorporates copper/ 
nickel-coated plastic, plug-in memory, 
processor, video cartridges, slide-in 
drives, and slide-on modules in a 486 
system designed for limited-experience 
users. The workstation/server family 
runs on the CTOS operating system, 
which supports the Presentation Man¬ 
ager graphical user interface and DOS 
Protected Mode Interface. Unisys Cor¬ 
poration; from $2,995. 

Reader Service No. 15 


Communications 


Access mainframes from the 
Macintosh 

An enhanced version of the 
Macirmalan LAN allows Macintosh 
systems to communicate with IBM 
mainframes. Macirmalan combines Dis¬ 
tributed Functional Terminal, IEEE Std 
802.2, and Synchronous Data Link 
Control software. With Macirmalan, 
users select either remote or local 
connections to the mainframe for up 
to 128 concurrent users on Apple Talk 
LANs using Local Talk, Token Ring, or 
Ethernet media. 

The DFT gateway component dis¬ 
tributes up to 20 host sessions across 
the LAN. The 802.2 gateway provides 
users in the same building with local 
access to the mainframe over a Token 
Ring backbone LAN to support up to 
128 host communications sessions. The 
SDLC component supports a remote 
line to the mainframe. Digital Commu¬ 
nications Associates; from $1,495. 

Reader Service No. 16 

System 7 connections 

Apple Macintosh users can access 
various host systems with the 
mxConnect software via asynchronous, 
VIP synchronous, VIP server, TCP/IP, 
LAT, X.25, or Communications Toolbox 
protocols. The System 7.0 mxConnect 
includes support for Kermit, Xmodem, 
Ymodem, Zmodem, non-protocol, 
mxFTF, and Communications Toolbox 
file transfer protocols. The package 
requires 1 Mbyte of memory. Cam¬ 
bridge Computer Corporation; $295. 

Reader Service No. 17 

Token Ring adapters 

Five 8- or 16-bit Token Ring adapt¬ 
ers ease connections in IBM PC XTs, 
ATs, or PS/2 Micro Channel comput¬ 
ers. In addition, the company offers 
four boot ROMs to load remote pro¬ 
grams on Novell- and Net BIOS-based 
LANs. Tiara Computer Systems. 

Reader Service No. 18 


Modem supports MAP LANs 

Manufacturing Automation Protocol 
users can interconnect machines in fac¬ 
tory environments with a baseband (no 
carrier frequency) modem called the 
MHW11005. Based on the MC68194 
carrierband modem, the MHW11005 
operates at a 5-Mbps data rate, uses 
frequency shift keying, and can be in¬ 
serted into a motherboard. Operating 
in temperatures from 0°C to +70°C, the 
PCB module supports equipment in¬ 
terconnected via coaxial cable for digi¬ 
tal information exchange purposes. 
Motorola; $250 (500s). 

Reader Service No. 19 

Transfer data over 5,000 
meters 

The Roadrunner series of IEEE-488 
bus extenders transfers data over 5,000 
meters using a proprietaiy data trans¬ 
mission protocol. Models 4889A and 
4889B transmit at 300 Kbytes/s and of¬ 
fer error detection and correction ca¬ 
pabilities. According to the company, 
the nondegrading data rate does not 
appreciably affect system response 
time. 

Each 115-VAC extender offers stan¬ 
dard coaxial and fiber-optical interfaces. 
ICS Electronics; from $1,695, 2 weeks 
ARO. 

Reader Service No. 20 

Communication software 
helps novices, experts 

Pere Line 3.0 is a commercial com¬ 
munications program with Windows 
and memory-swapping features that let 
experts explore numerous configura¬ 
tion operations. In contrast, novices 
may let Pere Line 3.0 work automati¬ 
cally once they have used the SAA-style 
pop-up menus. 

The package supports 45 modems 
and ISDN adapters, enables another ap¬ 
plication to am while files are being 
transferred or downloaded, and oper¬ 
ates on PC/AT/XT or PS/2 compatibles. 
Pere Line Data Systems; $49.95plus $6 
shipping and handling. 

Reader Service No. 21 
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PC fax modem 

An internal modem for PCs combines 
a 2,400-bps data modem and 9,600-bps 
Quick Link II send/receive fax capa¬ 
bilities. The PM2400FX96 supports the 
Hayes AT command set and lets users 
receive a fax while using the computer 
for other purposes. Users can send mul¬ 
tiple files with one phone call and send 
a fax directly from the DOS prompt 
when using the modem. The 
PM2400FX96 comes with a five-year 
performance warranty that covers parts, 
labor, and factory repair or replace¬ 
ment. Practical Peripherals, Inc.; $209. 

Reader Service No. 22 



Practical Peripherals fax modem 


Transceivers reduce costs 

A 2.5-ounce transceiver with a two- 
year warranty reduces Ethernet cabling 
costs by replacing AUI drop cables with 
inexpensive unshielded twisted-pair 
cables. Status indicators on the AT-110T 
transceiver constantly monitor DTE 
power and valid links, while a loop- 
back function emulates coaxial media 
to determine whether a media access 
unit is connected. Allied Telesis Inc.; 
$79.95 (volume discounts). 

Reader Service No. 23 


DSP hardware/software 


VMEbus DSP moves data at 
200 Mbytes/s 

The DSP-3A multiprocessor board 
includes three connected, 40-MHz 
TMS320C30A devices to support fast 
parallel or pipelined DSP operations. 


A set of graphical object-oriented pro¬ 
gramming tools available for the DSP- 
3A speeds application development, 
and a graphical software tool lets tasks 
be wired together pictorially on a Sun 
or X-Window workstation. The DSP- 
3A fits in a 6U slot of a standard 
VMEbus card cage. PC/M Inc.; $17, 779 
each; 45 days ARO. 

Reader Service No. 24 



The PCIM DSP-3A board 


DSP for the PC 

PC Data Master 3.0 is a signal pro¬ 
cessing system for 512-Kbyte IBM PCs 
and compatibles. The system combines 
graphics, data-sampling, test data gen¬ 
eration, and data file math routines with 
DSP and general-purpose utilities. Us¬ 
ers can integrate their own data analy¬ 
sis and graphics functions using various 
language compilers or MS-DOS 3-0 as¬ 
semblers. 

PC Data Master runs best with a hard 
disk, 1.4-Mbyte on-line disk space, and 
640-Kbyte RAM. Durham Technical 
Images; $185plus $95 for an academic 
site license. 

Reader Service No. 25 


System controls seven SCSIs 

Designed for use in PC AT-bus com¬ 
puters, the MP3210 system uses a 
DSP3210 32-bit, floating-point chip and 
a DT-Connect interface to provide digi¬ 
tal and audio I/O and real-time video 
signal transfer. Capable of controlling 
seven SCSI devices, the MP3210 speeds 
multiprocessing, multitasking, and ap¬ 
plications development tasks. The 
system’s DSP56ADC16 sigma-delta 
oversampling converters offer analog 
I/O with a signal-to-noise ratio of 90 
dB. Two 16-bit analog channels pro¬ 
vide concurrent outputs at sample rates 
up to 50 kHz per channel and include 
digital oversampling reconstruction fil¬ 
ters. Ariel Corporation; $4,995. 

Reader Service No. 26 

Sampling converters 

Two 12-bit sampling monolithic con¬ 
verters plug into most ADC574 sock¬ 
ets without system modifications. With 
switched capacitor array CMOS struc¬ 
tures, the ADS574 and ADS774 feature 
internal sampling and one +5V power 
supply operation. Users can control the 
sampling function to eliminate the need 
for external sample/holds in most 
designs. 

ADS574’s maximum throughput time 
for 12-bit conversion is 25 gs, includ¬ 
ing acquisition; ADS774 converts and 
acquires data in 8.5 (is. Both convert¬ 
ers come in 0.3- or 0.6-inch-wide, 28- 
pin plastic or side-brazed hermetic 
ceramic DIPS, 28-pin SOICs, and in die 
form. Burr-Brown Corporation; from 
$14.15 (OEMs, 100s). 

Reader Service No. 27 

Embedded antialiasing 
filters 

The DT3831 series of 12-bit, 50-kHz 
or 250-kHz, one-slot AT-compatible 
data acquisition boards include embed¬ 
ded antialiasing filters and the com¬ 
pany’s Real-Time Error Prevention 
circuit. Front-ending each analog in¬ 
put with a 4-pole Butterworth filter on 
one board reduces noise inherent in 
alternative two-board solutions while 
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maintaining accuracy within the filter’s 
passband. The RTEP circuit adds on- 
the-fly offset calibration of channel, 
range, and gain value combinations. 

Each board contains a driver, toolkit, 
and a Gallery demonstration package. 
Data Translation; from $3,695 each. 

Reader Service No. 28 
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Data Translation DT3831 data boards 


SBus TMS320 board 

An SBus board built around the 
TMS320C30 DSP provides a develop¬ 
ment platform and real-time data ac¬ 
quisition capability for the Sparcstation. 
With 512 Kwords of SRAM and dual¬ 
port RAM, the board adds speed, arith¬ 
metic commands, and address modes 
for real-time I/O to Sparc’s central pro¬ 
cessor performance. The company’s 16- 
bit DSP-Link expansion bus permits the 
SBus board to be connected to stan¬ 
dard interfaces and provides commu¬ 
nication between SBus C30 boards. 
Spectrum Signal Processing Inc.; 
$4,595each; optional analog I/O mod¬ 
ules, $995 each. 

Reader Service No. 29 

Integrated PC system 

A DSP development and data acqui¬ 
sition system contains the BN3000 32- 
bit, floating-point DSP processor board, 
and the BN3216 two-channel, 16-bit 
analog interface module. 

Together, the board and module 
provide an integrated 33-Mflops signal¬ 
processing system in a PC. Alone, the 
BN3000 system analyzes signals, per¬ 
forms high-speed computation in a PC, 
and can be expanded with a 32-bit 
parallel I/O expansion bus through two 


TMS320C30 serial ports. Bridgenorth 
Signal Processing, Inc.; $2,995 
(BN3000), $995 (BN3216). 

Reader Service No. 30 



Bridgenorth Signal Processing system 


A/D components 


RS485 drivers draw 200 pA 

CMOS/Schottky communications de¬ 
vices called the LTC486 and LTC487 
feature 28-ns driver propagation delays 
with 5-ns skew. Both ICs incorporate 
a thermal shut-down protection feature 
and come in 16-pin molded DIP and 
16-lead, wide-body SOIC packages. 
Each IC supports data-transmission 
rates up to 10 Mbps, operates from one 
5V supply, and draws 200 (iA of sup¬ 
ply current. Linear Technology Corpo¬ 
ration; $3-70 (samples, 100s). 

Reader Service No. 31 

SRAMs in ZIPs 

Two JEDEC, 2-Mbit fast SRAMs in 
zag in-line packages promise 15- and 
20-ns access times. The MCM3264 con¬ 
tains eight MCM6209 64K x 4 fast static 
RAMs, while the MCM8256 uses eight 
MCM6207 256 x 1 fast SRAMs. Motorola 
MOS Memory Products Division; from 
$195 (100s). 

Reader Service No. 32 

Laser printer controller 

High-volume users, OEMs, and VARs 
can use the LC-8000 laser printer con¬ 
troller to produce text and graphics 
output for CAD/CAM, law, insurance, 
and finance applications. An input 


buffer allows infonnation to be fed into 
the printer at 2 Mbytes/s; separate sec¬ 
tions of the controller draw characters, 
manage a font library, erase the page 
image, and print. Text and graphics can 
be output at the speed of a 20- or 50- 
ppm print engine. Advanced Technolo¬ 
gies International; $3,750, volume 
discounts available. 

Reader Service No. 33 

VGA notebook controller 

The 160-pin, quad flat pack CL- 
GD6410 controller lets a VGA sub¬ 
system be implemented with five chips 
that occupy 4 square inches of space. 
Features include 64 gray shades on a 
monochrome LCD, a 512-color active- 
matrix LCD, and simultaneous opera¬ 
tion of the LCD panel and an analog 
CRT monitor. Also required to create 
the VGA/CRT subsystem are two 
DRAMs for video memory, a dock syn¬ 
thesizer, and a 64K x 4 frame accelera¬ 
tor memory, if a dual-panel LCD is 
used. Cirrus Logic, Inc.; $50 each 
(samples). 

Reader Service No. 34 

Four-Mbyte floppy-disk 
controller 

The FDC37C65C+ floppy-disk con¬ 
troller incorporates support for 4-Mbyte 
floppy-disk drives and most drive for¬ 
mats. The “+” designation represents 
the addition of a 16-byte FIFO and ver¬ 
tical recording format mode to the 
company’s FDC37C65C. Standard 
Microsystems Corporation; $6.31 (50s). 

Reader Service No. 35 

EPROM adapters available 

Users wishing to replace an older IC 
with a more-available type can upgrade 
boards with two EPROM adapters. The 
adapters contain Correct-A-Chip tech¬ 
nology and rerouting circuitry to switch 
pins. 

A 28-pin EPROM adapter, the 2564, 
switches pins 1 to 28, 2 to 23, 20 to 22, 
22 to 27, 23 to 20, and 27 to 28. A 
second 28-pin adapter, the 2532-2732A, 
switches pins 18 to 21 and 21 to 18. 
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New Products 


Aries Electronics, Inc.; $10.68 (500s). 

Reader Service No. 36 

Op amps improved 

According to the company, its 
LT1124 dual and LT1125 quad opera¬ 
tional amplifiers outperform the OP27, 
the OP270 dual, and OP470 quad op 
amps they were designed to replace. 
For example, improvement ranges from 
a factor of 1.6 to 3.7 in slew rate and 
greater bandwidth, lower bias and off¬ 
set currents, and higher gain than the 
OP27. Applications include instrumen¬ 
tation amplifiers, active filters, low- 
noise signal processing, microvolt 
accuracy threshold detection, and in¬ 
frared detectors. Linear Technology 
Corporation; from $3-65 (100s), from 
$7.05 (military temperature grades). 

Reader Service No. 37 


Boards/test equipment 

Is there a 500-user PC in your 
future? 

An intelligent I/O controller supports 
high-speed, Unix-based multiuser sys¬ 
tems with up to 512 ports when used 
with four host cards. One RIO host card 
supports up to 128 ports at sustained 
throughput levels of 57.6-Kbps full 
duplex. 

According to the company, RIO 
transmits and receives data on all 512 
ports simultaneously with no degrada¬ 
tion. RIO’s Inmos T225-based remote 
terminal adapters link to a host, such 
as a 386 or 486 PC with Unix or a RISC- 
based Unix workstation, through four 
10-Mbps data channels, which are con¬ 
nected to the adapters placed up to 
250 feet from the host. Specialix Inc.; 
$ 795 (cards), $800 (adapters). 

Reader Service No. 38 

EISA backplanes with bus 
mastering 

Two passive-bus EISA connector 
boards feature a bus master connector 
as an upward migration path from 8- 
or 16-bit ISA cards. The eight-slot 
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QC1118 and the 12-slot QC1119 each 
contain one slot with a proprietary 
EISAEX connector to control up to six 
EISA bus master slots. Both boards pro¬ 
vide IBM PC AT keyboard interface, 
speaker driver, and external reset in¬ 
put functions. Designed for OEMs and 
system integrators, the backplanes 
mount in most cabinets that accommo¬ 
date an AT motherboard. MNC Inter¬ 
national; from $295 (evaluation units). 

Reader Service No. 39 



MNC International EISA boards 


100-Mflops i860 accelerator 

Developed around the 64-bit Intel 
i860/XP, the DASH!860/50 application 
accelerator offers 100-Mflops and 50- 
MIPS performance when installed into 
a 486/386 or 286 PC with 1 Mbyte of 
memory. The 50-MHz add-in uses an 
expansion interface to transfer data at 
160 Mbytes/s and supports vectorizers, 
math libraries, graphics, and neural 
network packages. A memory manager 
supports zero-wait-state, pipelined 
memory access; the board’s 8-Mbyte 
DRAM can be expanded to 32 Mbytes. 
Myriad Solutions Ltd. 

Reader Service No. 40 

Single-boards offer versatility 

The 1386 single-board computer was 
designed around the Intel 80386DX 
CPU to run at 25 or 33 MHz and fea¬ 
tures a 32- or 128-Kbyte cache with up 
to 32 Mbytes of 70-ns SIMM DRAM. 
The 1386 also incorporates either an 
Intel 80387DX coprocessor or a Weitek 
3167 numeric data processor. 

A second computer, the i486, con¬ 
tains an internal floating-point unit, an 
8-Kbyte internal cache, and either the 


25- or 33-MHz Intel 80486DX combined 
with the 496 Super Cache. 

Both boards include a hardware/soft- 
ware programmable watch-dog timer; 
a built-in, extended set-up program for 
programmable I/O and bus speeds; and 
buffered drivers capable of accommo¬ 
dating up to 19 additional expansion 
cards. IBus PC Technologies. 

Reader Service No. 41 



IBus PC Technologies boards 


Imager works without frame 
grabber 

The EDC-1000HR digitizes images 
into 754 x 488 eight-bit pixels without 
the need for a frame grabber or third- 
party hardware or software. The 
digitally controlled/digital output, 
monochrome camera works with IBM 
PCs, XTs, ATs, or equivalents through 
its PC-controlled interface card. Users 
can operate the asynchronous camera 
in either interlaced or noninterlaced 
modes, saving TIFF and PCX file, im¬ 
ages for further image processing and 
desktop publishing packages. The 
EDC-1000HR uses a frame-transfer CCD 
image sensor configured into 244 lines 
with 754 elements in each line. Electrim 
Corporation; $850. 

Reader Service No. 42 

Solve automation problems 
with 32-switch card 

An 8-bit computer board lets PCs 
select and control 32 analog or digital 
signals for use in laboratories, factory 
automation, environmental control 
systems, and security systems. Since 
there is no address limit to the number 
of boards that can co-reside in one PC, 
this board can control the connection 
of multiple signals through as many 
switch cards as a system has slots avail- 















able. The 32-Switch Reed Relay Card 
contains reed relays that are transpar¬ 
ent to the signals they are controlling. 
According to the company, signals up 
to 100 volts and 10 watts are unaffected 
while controlled by the computer. 
Accusys, Inc.; $395. 

Reader Service No. 43 



In-circuit emulator 

System designers using 16- and 32- 
bit microprocessors can monitor sys¬ 
tem elements with the Mice-V real-time 
in-circuit emulators, stopping at any 
point to view the system for bugs. 


When supporting 486SX or 487SX ap¬ 
plications, the Mice-V-486, 486SX, and 
487SX emulators provide an isolation 
mode so users can remove the 486 chip 
from the probe socket and connect 
logic analyzer clips to the pod to gather 
timing information. Microtek Interna¬ 
tional, Inc.; from $29,500; 4 weeks 
ARO; updates available. 

Reader Service No. 44 

Laser test source 

The S793 dual laser test source for 
single-mode fiber optic networks com¬ 
bines 1,300-nm and 1,550-nm lasers 
into a hand-held package. Lasers in 
mounts couple directly with FC or ST 
connectors to reduce the size and cost 
of pigtailed lasers. When used with a 
typical fiber optic power meter, the 
S795 lets users test cable loss up to 45 
dB with CW output. Fotec Inc.; $3,250. 

Reader Service No. 45 


Power meter tests laser source 

The M247 optical power meter lets 
factory and field users test the laser di¬ 
ode source used in erasable optical disk 
drives or laster printers. The tester uses 
a separate 5-mm-diameter silicon pho¬ 
todiode that is calibrated to measure 
optical power at 780- to 850-nm wave¬ 
lengths with up to 20 mW of power. A 
9-volt alkaline batter provides 100 hours 
of operation. Fotec Inc.; price depends 
on detector configuration. 

Reader Service No. 46 


Reader Interest Survey 

Indicate your interest in this department 
by circling the appropriate number on 
the Reader Service Card. 

Low 189 Medium 190 High 191 


Call for Papers 


IEEE Micro invites authors to submit 
papers for three special issues: 

• Associative memories and processors in April 1992 

Submit manuscripts by 11/15/91 

• ICs for HDTV in October 1992 

Submit manuscripts by 3/15/92 

• Signal processing in December 1992 

Submit manuscripts by 5/15/92 


Submit six copies of papers to: 

Editor-in-Chief Dante Del Corso Associate EIC Ashis Khan 


Dipartimento di Elettronica 
Politecnico di Torino 
C.so Duca degli Abruzzi, 24 
10129 Torino, Italy 
e-mail: delcorso@polito.it 


Mips Computer Systems, Inc 
or 950 DeGuigne Dr. 

Sunnyvale, CA 94086 
phone (408) 524-7171 
e-mail: ashis@mips.com 



For author guidelines, contact Krista Tague, 
IEEE Computer Society West Coast Office, 
(714) 821-8380, or fax (714) 821-4010. 
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Product Summary 

Joe Hootman 

University of North Dakota 


Manufacturer 


Model Comments 


R.S.# 


Displays 

Qume Corporation 


Telex Communications 


Systems 

AST Research Inc. 


Digital Vision 


Gespac 


Software 

Adobe Systems 


Applied Microsystems 


QM848, 857, Noninterlaced color monitors support graphics applications, win- 80 
870 monitors dowing software, and GUIs. The 14-inch QM848 and 857 are fully 
noninterlaced at resolutions from 640 x 480 to 1,024 x 768 and 
feature a 0.28-mm dot pitch. The QM870 features a 17-inch flat 
screen with noninterlaced display up to 1,280 x 1,024 pixels and a 
0.26-mm dot pitch. From $799. 


Magna Byte Full-color LCD computer projection panel features Macintosh and 81 
6001 IBM capabilities equipped with VGA output. The 640 x 480- 

resolution rear-projection panel offers 30-ms or 50-ms pixel re¬ 
sponse time for creation of smooth animation without smeared 
images. Users may select 8, 64, or 2,197 RGB color modes with a 
60 to 1 minimum contrast ratio. $5,995. 


Premium Exec Two options for notebook computer line include a 9,600-bps data/ 82 
option fax send-and-receive modem and a second RS232 serial port 

adapter. Premium Exec currently offers one serial, one parallel, and 
one external keyboard/mouse port. $499 (modem), $99 (adapter). 


Computer Color video frame grabber for PC-compatible computers supports 83 

Eyes/RT multimedia, animation, remote image telecommunicating, and 

industrial machine vision applications. The board includes soft¬ 
ware with drop-down menus for l/30th-second grabs and 512 x 
512 resolution at 24 bits per pixel in 16 million colors. $599-95. 


CVICLN1728T Line of linear-scan CCD camera, interfaces, and software includes 84 
a 1,728-pixel resolution and produces a 6-bit digital output for 64 
gray levels. A 2-Mbyte/s signal output permits the camera to be 
placed 100 meters from the microcomputer. Two cameras can 
use a CVICLN-G2 bbard to interface to a G-64 system running 
OS-9 or MS DOS. $1,655 (camera), $995 (interface). 


Photoshop Apple Macintosh image-processing package includes enhanced 85 

V. 2.0 color and black-and-white image editing, direct CMYK editing, 

editing of selected image areas, and importing of Illustrator- 
compatible EPS files. Requires an SE or II with 2 Mbytes of RAM, 
hard-disk drive, and System 6.04 or later software. Also recom¬ 
mended is a 68020 CPU, color monitor, and 4 or more Mbytes of 
RAM. $199 (upgrades). 


Code Tap Runtime development tool debugs software programs running on 86 

386SX Intel 886 SX microprocessors and monitors and controls execution 

in the target without using target memory or I/O, or requiring prior 
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Manufacturer 


Model 


Comments 


Circuit Search 


Inset Systems 


National Instruments 


Miscellaney 

Cubit 


Martech 


National Instruments 


Specialized Systems 
Consultants 


R.S.# 

code modification. The tool chain consists of a target access probe, 
an RS232 communications adapter, windowed source-level de¬ 
bugger, Validate/Softscope III, and Pharlap ASM/Linkloc. $5,995. 

Database, This dBase Ill/dBase III+-compatible database contains references 87 

V. 1.09 to 13,000 articles and papers from 300 technical and scientific 

journals and magazines. Users locate circuits by keywords after 
installing the database on an IBM PC-compatible hard disk and 
using in conjunction with either the supplied menu-driven front 
end or a dBase-compatible database management program. $375 
(includes one free semiannual update). 


Hijaak 2.02 Graphics utilities include a Super VGA-screen capture program 88 

for Windows 3.0 that saves up to 256 colors simultaneously. 

Once saved to disk, Hijaak converts the image to supported raster 
formats, fax card formats, or Post Script. Also available are a 5- 
Kbyte TSR for DOS and a conversion program for 36 graphics file 
formats. Free or $50 (upgrades), $199 (all others). 

Lab Driver for Dynamic link library operates under real, standard, and enhanced 89 

Windows modes of the Microsoft Windows 3.0 operating system. The library 

controls the company’s plug-in data acquisition boards for XT/AT, 

PS/2, and EISA PCs with programs written in programming 
languages that support Windows’ DLLs. Low-level functions for 
analog, digital, and timing I/O and high-level functions for stream- 
to-disk and waveform generation programs control the boards. 


Model 7050 Color graphics CRT controller communicates on an 8-bit or 16-bit 90 

controller STD bus to offload most of the graphics-handling load from the 

main system CPU. The 7050 supports up to three screens with 640 
x 480 resolution, includes 4 Mbytes of video RAM, and is compat¬ 
ible with VGA, EGA, and monochrome monitors. $490 each. 


Workstation Memory boards provide an alternative to HP Apollo 9000 series 91 

RAMs 700 workstation RAMs. The 16-Mbyte and 32-Mbyte module 

memory boards are available from stock for next-day delivery. 

$3,360 (16 Mbytes), $6,720 (32 Mbytes). 


PC-LPM-16 Surface-mount version helps users develop portable data acqui- 92 

I/O board sition systems using laptop computers. The 4.35-inch x 3.9-inch 

board supports applications of signal and transient analysis, data 
logging, switching external devices, reading the status of external 
digital logic, synchronizing events, generating pauses, and 
measuring frequency and time. 


Revised reference card reflects changes in the finalization of the 93 
ANSI C ANSI C standard established by the X3J11 committee. The eight- 

reference card page card combines examples and specifications to illustrate C 
and explains statement formats, functions, constants, and pre¬ 
processor commands by example. $3. 
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Simulating a visual function 


continued from p. 11 

network capable of handling images with up to 64 x 64 reso¬ 
lution as illustrated in Figure 2b. 

Experimental results 

We performed experiments on a Sun Sparcstation 1 with 
the Sun OS 4.0.3 operating system, conducting SPICE simula¬ 
tions (XSPICE 3) with 2D test patterns and images. 

Testing the PE. Our first test involved verification of the 
design of the modifiable thresholding unit. We used five ran¬ 
domly chosen T's of image intensity to simulate possible 
situations in motion detection. With each given T, we then 


determined the mapping voltage by Theorem 1. Using this 
voltage, we specified the W/L ratio by using Theorem 2. Next, 
we constructed a thresholding unit and simulated the pro¬ 
cessing element. Table 1 lists a set of the simulation results, 
with T, V m , and the W/L ratio. Figure 3 illustrates the corre¬ 
sponding set of VTC curves, which match the design specifi¬ 
cations. For example, when a gray-level threshold equals T = 
127, the corresponding voltage is = 2.5V, which is then 
realized in hardware by setting ( W„/L„) / (W p /L^) equal to 1. 

Testing the 2D network. We constructed a small two- 
dimensional network with 6x6 PEs to form a 2D array pro¬ 
cessor. Each PE has two input pixels from test patterns 1 and 
2 at the same ( x,y ) location. We randomly shifted each pat¬ 
tern to produce a time-variant image. To simulate a real situ¬ 
ation of dynamic image processing, we added random noise 
with a maximum 10-pixel variation 
to the test patterns and fed the test 
patterns into the network as two con¬ 
secutive image frames. Each PE in the 
network operates concurrently, first 
performing differencing operations 
then thresholding. The processed 
outputs are collected at the output 
node of the network for comparison 
to the numerical computation. The 
experimental result illustrated in Fig¬ 
ure 4 shows that the network behaves 
well under simulated noise condi¬ 
tions. One can see from Figure 4b 
that the network has captured the 
nonstationary region. 

To test the 2D network with 64 x 
64 analog PEs, we wrote a program 
in C to generate a SPICE program of 
the network with about 100,000 com¬ 
ponents. A block diagram given in 
Figure 5 illustrates the procedure. 
Note that all the processing results 
are compared to numerical compu¬ 
tation. 3,4 

For the next stage of testing we 
created synthetic test images. The first 
image frame contains two objects: a 
square and a triangle. The second 
image contains the same objects with 
the square moved to a new position. 
We selected a gray-level threshold 
T= 179 and calculated the W/L ratio, 
which is equal to 0.1837, for each PE. 
Figure 6 on page 46 gives the pro¬ 
cessing result. Note that Dfx ,y,f) cor¬ 
responds to the triangular regions in 
Figure 6a and 6b, while D 2 (x,y, t) pro- 


Table 1. Relationship of image intensity, 
voltage threshold, and W/L ratio. 


Threshold T in 
image intensity 

Voltage 
threshold (V jn ) 

SPICE 

simulation 

(W/L) n : (W/L)/ p 
ratio 

100 

1.96 

Matched 

2.4056 

127 

2.50 

Matched 

1.0000 

163 

3.20 

Matched 

0.3164 

179 

3.50 

Matched 

0.1837 



Steps* 

* Input voltage range 0-5V with 50 steps, which maps to 256 gray levels 


Figure 3. SPICE simulation result of the thresholding unit with different \/ in . 
Note that we designed each VTC curve with a different gray-level threshold, 
which is implemented by defining different W/L ratios. 
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v(1)= 1.340129e-07 
v(2) = 1.340129e-07 
v(3) = 1.340129e-07 
v(4) = 4.999944e+00 
v(5) = 4.999944e+00 
v(6) = 4.999944e+00 


v(7) = 1.340129e-07 
v(8) = 1.340129e-07 
v(9) = 1.340129e-07 
v(10) = 1.340129e-07 
v(11) = 1.340129e-07 
v(12) = 4.999944e+00 


Generating a 
SPICE program 
by using a progrj 
written in C 


Performing SPICE 
simulation of a 2D 
analog network for 
motion detection 


Verifying the 
simulation result 
by comparing to 
the numerical 
computation 


Displaying the 
SPICE simulation 
result—DP picture 



Figure 5. Block diagram showing the procedure for gener¬ 
ating and testing a 64 x 64-PE network. DP indicates the 
difference picture or D(x,y,t). 


v(13) = 1.340129e-07 
v(14) = 4.999944e+00 
v(15) = 4.999944e+00 
v(16) = 1.340129e-07 
v(17) = 1.340129e-07 
v(18) = 4.999944e+00 

v(25) = 1.340129e-07 
v(26) = 1.340129e-07 
v(27) = 1.340129e-07 
v(28) = 4.999944e+00 
v(29) = 4.999944e+00 
v(30) = 1.340129e-07 

(b) 


v(19) = 1.340129e-07 
v(20) = 1.340129e-07 
v(21) = 4.999944e+00 
v(22) = 4.999944e+00 
v(23) = 1.340129e-07 
v(24) = 4.999944e+00 

v(31) = 1.340129e-07 
v(32) = 1.340129e-07 
v(33) = 1.340129e-07 
v(34) = 1.340129e-07 
v(35) = 1.340129e-07 
v(36) = 1.340129e-07 


Figure 4. Two 6x6 test patterns, each being randomly 
shifted, with random noise added (a); and the SPICE simu¬ 
lation result (b). 


duced the reversed-L region in Figure 6c. We compared this 
image to the numerical computation and found that they 
matched. 

Finally, using a DT2852 image grabber, we digitized a se¬ 
quence of tool images from a laboratory scene. The images 
have 256 x 240-pixel resolution and 8-bit gray levels. We 
rearranged the tools on the scene and arbitrarily chose the 
two consecutive image frames shown in Figure 7 on page 46. 
We used the 64 x 64-resolution window shown in Figure 7a 
to circumscribe portions of the images for testing. The thresh¬ 
old value of image intensity is T= 163, and the correspond¬ 
ing voltage threshold is V m = 3.20V. As a result, the processed 
image LXx, y, t) given in Figure 7b shows the brighter re¬ 
gion corresponding to D^x,y, f) and the grayer region cor¬ 
responding to D 2 (jx , y, t). This result correctly picks up the 
nonstationary regions. 
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Simulating a visual function 



Figure 6. Testing the network: With two objects, a triangle and a square (a); with a square moved to a new position (b); 
and the computation result collected from the output nodes of the network (c). The nonstationary region D,{x,y,f) ap¬ 
pears on the left, and D 2 (x,y f t) on the right. 




Figure 7. Laboratory images with tools displaced during acquisition: A 64 x 64-resolution window circumscribes the por¬ 
tion to be processed (a); and two test patterns show the different positions of the tool (b). The output of the processing 
result from the network is displayed, clearly indicating the displacement. 


Our 2D ANALOG VLSI NETWORK simulates a function 
of motion perception in the human visual peripheral pro¬ 
cess. We mapped the image-processing problem to analog 
computing with mathematic formulation and proved theo¬ 


rems to assist network design. Finally, we performed SPICE 
simulations and experiments with real laboratory images to 
verify the design. The network processes images in real time, 
matching its performance to the theoretical analysis result. 
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The future research of incorporating high-level visual pro¬ 
cessing is under investigation. (P 
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Signal analysis 


continued from p. 15 

Noninvasive conduction-velocity estimation. Muscle 
fiber conduction velocity is the propagation velocity of the 
depolarization along the membrane of muscle fibers. Con¬ 
duction velocity is a basic physiological parameter related to 


the type and diameter of the muscle fiber, to the pH of inter¬ 
stitial and intracellular fluids, to ion concentrations in differ¬ 
ent compartments, and to motor unit firing rate or stimulation 
frequency. (See the box for techniques used to estimate con¬ 
duction velocity.) 

Conduction velocity affects the power spectrum of the 
myoelectric signal, 1 and its variation contributes to spectral 
compression during the fatigue process. 1,2 


Estimation of muscle-fiber conduction velocity 


Among several other techniques, we describe those that 
are important for historical reasons or because of their clini¬ 
cal relevance. 

Method based on spectral dips. The spectral-dips tech¬ 
nique can be used with the bipolar detection technique 
(see the earlier box on signal detection). The myoelectric 
signal is filtered by the transfer function of the differential 
electrode. We can describe the squared modulus of such a 
transfer function with the following equation: 1 

|w(jco)f-^sin 2 ^ (l) 

where I H(j co) I is the modulus of the transfer function of 
the differential electrode, K is a constant, co is the radian 
frequency, d is the interelectrode distance, and v is the 
muscle-fiber conduction velocity. 

Equation 1 shows that the transfer function of the modulus 
of the differential probe equals zero when (co d/v) equals 
2 Kn. In other words, for frequencies satisfying Equation 2 
below, the power spectral density function of the signal 
equals zero: 

f= K i (2) 

where K~ ± 1, ± 2, ± 3, .... 

Consequently, we might estimate conduction velocity 
by measuring the frequency at which the first spectral dip 
occurs. In practice, however, this method is rarely applied. 
Under typical conditions, the first spectral dip cannot be 
detected because it occurs in a frequency band in which 
the power associated to the myoelectric signal is close to 
zero. Also, when the myoelectric signal generated during 
voluntary contractions is studied, the power spectrum esti¬ 
mate is affected by random errors that are generally diffi¬ 
cult to reduce to a level allowing for reliable and accurate 
detection of spectral dips. 

Cross-correlation. Another technique is to use the cross¬ 
correlation function to estimate the time delay between 
two myoelectric signals collected at two different sites along 
the active muscle fibers (see Figure D). If we assume the 




Figure D. Schematic representation of the cross-correla- ! 
tion technique: physiological model (1) and cross-corre¬ 
lation function between signals X and / (2). 

system is space-invariant within the detection area of the 
electrode, the two signals that result are time shifted by the 
interval At, as given by the following equation: 



where Ax is the time delay between the two signals, d is 
the interelectrode distance, and v is the muscle fiber con¬ 
duction velocity. 
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Moreover, since conduction velocity relates to the size and 
histochemical properties of muscle fibers, observing it dur¬ 
ing different muscle activities may help in studying the re¬ 
cruitment strategy of motor units, 3 and possibly in 
characterizing muscle architecture and fiber typing. 23 

SMES processing techniques. During voluntary muscle 
contractions, the myoelectric signal contains infonnation re¬ 
lated to both the nervous system (the firing rate interactions 
of each motor unit) and the muscle tissue. When studying 
muscle fatigue, we need to separate the two contributions, 
especially the manner in which they influence the power- 
density spectrum of the myoelectric signal. Moreover, muscle 
fiber conduction velocity strongly influences the surface myo¬ 
electric signal power spectrum. 14 But, it may also be affected 
by changes in the motor unit firing rate. 

Electrical stimulation for muscle characterization. To char¬ 
acterize the muscle without any influence from the central 
nervous system, we can produce muscle contraction by us¬ 
ing proper devices to simulate either nerve bundles or muscle 
motor points. A motor point is that area of skin above a 
muscle that shows the highest sensitivity to electrical stimuli 
applied with surface electrodes. During electrical stimulation, 
the firing rate of motor units can be kept constant, thus sepa¬ 


rating the contribution of central and localized muscle fa¬ 
tigue. Figure 2 on the next page shows some examples of the 
time course of spectral parameters and conduction velocity 
during voluntary and electrically elicited contractions of the 
human tibialis anterior. 

The observed time course of spectral parameters and con¬ 
duction velocity obtained during electrically elicited and vol¬ 
untary contractions shows that stimulated contractions yield 
smoother behavior of the parameters. A comprehensive com¬ 
parison of the two techniques and a discussion of their po¬ 
tentialities is available elsewhere. 23 

Myoelectric signal processing. Most information currently 
obtained from SMES analysis comes from the study of the 
time course of time and frequency parameters of the myo¬ 
electric signal generated during voluntary or electrically elic¬ 
ited contractions. Here we introduce some technical 
considerations in the computation of these parameters. 

Earlier, we briefly presented the mathematical properties 
of SMES and pointed out that the myoelectric signal obtained 
during voluntary isometric contractions at constant force may 
be considered as a band-limited stochastic process with a 
Gaussian distribution of amplitudes. 1 In most cases, during 
relatively low-level and short contractions, the myoelectric 


The cross-correlation technique relies on an impor¬ 
tant property of the cross-correlation function: If two 
signals are equal but time-delayed with respect to each 
other, the peak of their cross-correlation function has a 
value equal to one, and its localization on the abscissa 
equals the delay between the two signals. To estimate 
the conduction velocity of a group of muscle fibers, we 
need an electrode that collects two signals at two differ¬ 
ent sites along the muscle. The two signals are generally 
acquired by a computer, which estimates the cross¬ 
correlation function between them and calculates the 
abscissa of its maximum value. Given the relationship 
presented in Equation 3, if we know the interelectrode 
distance, we can obtain the value of the conduction 
velocity. 

To obtain an acceptable resolution in the conduction- 
velocity estimation, we must compute the time delay 
between the two signals with a resolution of ± 10 to 20 
ps. Such a resolution can be obtained by either 
oversampling the signals or interpolating the cross¬ 
correlation function. Oversampling is the less efficient 
solution: It is burdensome computationally and requires 
the storage and manipulation of long sequences of data. 
Interpolation techniques increase the time resolution by 
interpolating the cross-correlation with a suitable function. 


Spectral matching. The spectral matching technique 
is similar to the cross-correlation approach, but it is based 
on the minimization of the L distance between the Fou¬ 
rier transforms of the delayed signals. Aligning two sig¬ 
nals shifted in time by minimizing the mean-square 
difference between their amplitudes is equivalent to mini¬ 
mizing the mean-square error between their Fourier trans¬ 
forms. 2 Generally, to achieve an acceptable resolution, 
the alignment must be performed using fractions of the 
sampling period. This is easy in the frequency domain, 
because one signal can be time-shifted by any time in¬ 
terval i simply by multiplying its Fourier transform by 
the complex number e ]V)X . 

Computationally, spectral matching is less expensive 
than the cross-correlation technique. CPU time may be 
reduced by a factor of 10. 
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Figure 2. Examples of the time course of median fre¬ 
quency, mean frequency, and conduction velocity during 
contractions elicited voluntarily (80 percent of maximum 
voluntary contraction) (a) and electrically (b). 

signal may be considered as wide-sense stationary. (A low- 
level contraction is 20 to 30 percent of the maximum volun¬ 
tary contraction force, and a short contraction lasts for some 
tenths of seconds.) Evaluation of stationarity is of paramount 
importance when we compute the power spectrum of the 
voluntary myoelectric signal. When we increase the contrac¬ 
tion level, the wide-sense stationary hypothesis generally holds 
during shorter epochs (0.5 to 1.5 seconds at 50 to 80 percent 
of the maximum voluntary contraction force). 

Generally, we compute the time course of time and fre¬ 
quency domain parameters by subdividing the signal into 
epochs during which the wide-sense stationary hypothesis 


holds. We then compute the values of the selected attributes 
over each epoch. 

Different techniques are generally used to estimate the 
power spectrum density functions of voluntary or electrically 
elicited myoelectric signals. To process a voluntary myoelec¬ 
tric signal, the spectral estimation technique most frequently 
adopted relies on the estimation of the power spectrum den¬ 
sity function, using the raw periodogram. After segmenting 
the signal into epochs in which the wide-sense stationary 
assumption holds, we compute the signal power spectrum 
simply by taking the squared modulus of the discrete Fourier 
transform of the signal itself. To increase computational effi¬ 
ciency, the discrete Fourier transform is generally computed 
with radix-2 fast Fourier transform algorithms. 

In most clinical applications, the power spectrum is esti¬ 
mated principally to extract the time course of spectral vari¬ 
ables—namely, the power spectrum mean and median 
frequencies. Although the raw periodogram is an inconsis¬ 
tent spectral estimator, results obtained in myoelectric signal 
processing are usually satisfactory thanks to the statistical prop¬ 
erties of the mean and median frequencies. 

During electrically elicited contractions, the signal is deter¬ 
ministic and quasiperiodic. The stimulation frequency im¬ 
posed by the stimulator determines the period of the signal. 
(We call each electrical response of a muscle to a stimulus 
the M wave.) Since the signal is quasiperiodic, three different 
strategies are frequently used to compute its attributes: 

• Subdivide the signal into epochs during which M waves 
may be considered as time-invariant. Then compute time 
and frequency attributes over each epoch. 

• Compute amplitude attributes and the power spectrum 
over each single M wave. Since each single M wave is 
processed, the computational cost of this approach is 
generally high, and the time series of the computed at¬ 
tributes reflects random variations of the M waves. The 
spectral density of the signal may be easily interpolated 
by zero padding in the time domain to obtain the de¬ 
sired spectral resolution. 

• Subdivide the signal into epochs during which M waves 
may be considered as time-invariant (as in the first 
method). Then compute the average M wave over each 
epoch and the amplitude and spectral parameters on 
the average M wave. If time averaging is performed prop¬ 
erly, this approach combines the advantages of the other 
two methods. 

Among the time domain parameters are the average recti¬ 
fied value, the root mean-square value, the muscle fiber con¬ 
duction velocity, and the peak and peak-to-peak amplitudes, 
duration, and the area of the M wave. We compute these 
parameters after segmenting the input stream of samples as 
described earlier. 
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Figure 3. Schematic representation of a real-time myoelectric signal processing system. MNF indicates the mean frequency 
and CV the conduction velocity. 


Real-time myoelectric signal processing. Off-line pro¬ 
cessing is well suited for research applications, but for most 
clinical applications on-line systems are better. In fact, if phy¬ 
sicians can observe on line the time course of spectral and 
amplitude parameters during voluntary and electrically elic¬ 
ited contractions, they can promptly detect the occurrence of 
nonstationarities related to physiological events taking place 
inside the muscle. Examples are the onset of muscle fatigue 
or a change in the strategy used by the central nervous sys¬ 
tem to recruit motor units. These events may be used as 
feedback to allow the subject to control the contraction more 
effectively. Moreover, in applications of functional electrical 
stimulation for gait restoration, spectral and amplitude pa¬ 
rameters of the myoelectric signal detected on line may be 


used to control the stimulation strategy to increase muscle 
endurance as much as possible. 

The availability of digital signal processing cards for indus¬ 
try-standard buses has made it possible to develop relatively 
low-cost, high-perfonnance systems for on-line processing 
of biological signals. In this section, we describe a recently 
published system. 

Figure 3 is a block diagram of a real-time myoelectric sig¬ 
nal analyzer consisting of an 80286-based PC with a digital 
signal processing card inserted into a slot. The DSP card, 
from Loughborough Sound Images, Ltd., is based on the Texas 
Instruments TMS320C25 DSP chip. Its principal features are a 
clock frequency of 40 MHz, a 64-Kbyte dual-port memory for 
both data and code, and an on-board 16-bit analog-to-digital 
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converter. The host computer and the DSP processor work 
in parallel. 

The host computer 

• manages the user interface so the user can store the 
patient’s general data and supervise the display of the 
results, 

• initializes the processing cycle and downloads the 
TMS320 executable code from the host computer’s hard 
disk to the program memory of the DSP board, and 

• manages the video operations through direct address¬ 
ing of the EGA or VGA card memory to obtain the real¬ 
time display of the results. 

At the same time, the DSP subsystem 

• manages the interrupts generated by the end of conver¬ 
sions coming from the analog-to-digital converter, 

• computes the 128-point fast Fourier transforms and evalu¬ 
ates the short-term power spectrum, 

• evaluates maximum, mean, and median frequencies of 
the power spectrum, 

• smooths the spectral parameters using a four-point mov¬ 
ing average, 

• pseudocolor-codes the spectral power densities, and 

• transfers the results to the host computer for real-time 
display. 

In a typical processing cycle, the system is initialized with 
user-supplied parameters, and the host computer downloads 
the desired DSP routine from its hard disk to the DSP card’s 
program memory. The host computer starts the process by 
resetting the TMS320C25 and then waits for the first frame of 
results. When the first frame is available, the two processors 
start working in parallel. The DSP subsystem performs two 
different tasks: 

• It services the interrupts generated by the analog-to- 
digital converter during the current subepoch t(j) when 
converted data are available, storing them in the data 
memory; 

• It processes data stored in the subepoch t(j- 1). 

At the same time, the host computer displays the results 
obtained by processing the data acquired in the subepoch 
t(j- 2). 

The delay of two subepochs (256 ms for a sampling fre¬ 
quency equal to 1 kHz) between data acquisition and graphic 
display of the results is acceptable for clinical applications. 
Although the throughput of the system is slightly higher than 
20 kHz, the user may select lower sampling rates to avoid 
overloading the system. Sampling rates ranging from 1 kHz 
to 2 kHz are generally satisfactory. A synchronization bit 


from a TMS320C25 register synchronizes the two processors. 

While the system collects and processes data in the back¬ 
ground, the menu-driven user interface runs in the foreground. 
The user chooses among several options with function keys: 

• Stopping or starting an evaluation session. 

• Displaying in real time the color-coded power spec¬ 
trum density functions and their mean and median fre¬ 
quencies. A freeze option lets the user stop the 
video-display update without stopping data acquisition 
and processing, and a record option saves the values of 
the frozen screen into an ASCII file for future use. 

• Varying the sampling rate. 

• Evaluating and displaying in real time the confidence 
intervals for spectral parameters. 

• Evaluating power spectrum density functions and spec¬ 
tral parameters averaged over subsequent subepochs, 
and displaying the results in three different graphics 
windows. 

• Displaying the myoelectric signal in the time domain to 
verify the signal quality on the host computer’s video 
monitor. 

This system is powerful enough to compute muscle fiber 
conduction velocity on line using the spectral matching tech¬ 
nique described in the earlier box on conduction-velocity 
estimation. It provides sampling frequencies up to 5 kHz 
and collects data from up to four different channels. The 
adjacent box on clinical applications describes some uses of 
such systems. 

Needle electromyography 

Needle electromyography techniques obtain diagnostic and/ 
or prognostic data by analyzing the myoelectric signal. Al¬ 
though most data are obtained from visual inspection of the 
interference pattern, in the last decade several techniques 
that use computers have been introduced into clinical prac¬ 
tice. Physicians have shown appreciable interest in the de¬ 
composition of the interference pattern and macro EMG 
techniques. Several myoelectric signal analysis systems now 
include them as options. 

We briefly describe these two techniques to show typical 
requirements for a computer system used in this field. The 
literature provides more detailed descriptions of the decom¬ 
position problem 1 and the macro EMG technique. 5 

Decomposition of the interference pattern. The myo¬ 
electric signal is the sum of the action potentials produced 
by each particular active motor unit. The decomposition of 
the interference pattern procedure decomposes the myoelec¬ 
tric signal into its constituent MUAP trains. 1 Researchers and 
physicians use this technique to investigate the strategies 
implemented by the neuromuscular system for controlling 
muscle contractions. 
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Two clinical uses of SMES analysis are muscle fatigue 
evaluation and noninvasive fiber typing. 

Muscle fatigue evaluation. SMES analysis has been used 
extensively in the study of muscle fatigue. Analysis and 
quantification of muscle fatigue are crucial in sport, occu¬ 
pational, and rehabilitation medicine. Fatigue is a continu¬ 
ous process that begins at the onset of a muscle contraction. 
It has two components: fatigue of the nervous system and 
localized muscle fatigue. The former is represented by the 
variation of the discharge activity of motor units during a 
sustained muscle contraction and is related to processes 
taking place in the nervous system. The latter is related 
only to processes taking place inside muscles. A compre¬ 
hensive review of muscle fatigue is available elsewhere. 1 

Since the beginning of this century, the frequency com¬ 
ponents of the surface myoelectric signal have been known 
to decrease during sustained contractions. Nonetheless, 
physiologists have become accustomed to describing (and 
measuring) muscle fatigue by monitoring the force output 
of the muscle. They usually refer to the onset of fatigue as 
a point in time when the subject cannot sustain the force 
output at a preset value, and measure fatigue as the con¬ 
tinual decrease of force production. 

This approach has drawbacks. It works best when the 
muscle activity is controlled artificially, for example, by 
electrical stimulation of the nerve or muscle under study. 
During voluntary contractions, the sincerity of the subject’s 



-Force 

-Mean frequency 


Figure E. Time course of muscle force and mean fre¬ 
quency of the SMES power density function during an 
attempted constant-force contraction. 


effort influences the level of force output. Also, during 
less-than-maximum-force contractions, a muscle can sus¬ 
tain a constant-force output for a definite amount of time 
before a decrease in force occurs. Nevertheless, during 
this interval the force-generation characteristics of the muscle 
fibers (mechanical properties and control properties of the 
motor units) change as a function of time. These modifica¬ 
tions, which have a strong bearing on the behavior of the 
muscle, are excluded from the evaluation of muscle fatigue. 

The contractile measure of muscle fatigue is not always 
useful from a pragmatic point of view. For example, if we 
want to monitor the rate of muscle fatigue during mean¬ 
ingful and productive muscle efforts, a measure of con¬ 
tractile fatigue would be useless because it would only 
indicate the onset of exhaustion experienced by the subject. 

On the other hand, spectral parameters allow the clini¬ 
cian and researcher to assess the evolution of the fatigue 
process from the beginning of the contraction, long before 
mechanical manifestations of muscle fatigue are evident. 
Figure E illustrates this concept. It shows the time course of 
the mean frequency of the SMES power spectrum and of 
the force output of a muscle during a maximal voluntary 
contraction. The power spectrum mean frequency starts 
decreasing from the beginning of the contraction, while the 
muscle fails to produce the desired effort 13 seconds later. 

The myoelectric signal power spectrum is affected by 
the firing rate of motor units, by the cancellation effect 
occurring because of the random superposition of motor 
unit action potentials, and by the shapes of the action po¬ 
tentials. During sustained contractions, all these factors 
change, but there is evidence 1 that the power-spectral 
compression toward lower frequencies results mostly from 
widening of the shapes of the motor unit action potentials. 
By studying the relationships among variations of spectral 
and amplitude parameters and conduction velocity, we can 
obtain much information about processes taking place in¬ 
side the muscle tissue. 

Recently, several dedicated devices have been designed 
for assessing changes of the power spectrum of the sur¬ 
face myoelectric signal due to localized muscle fatigue. 
Among others, the muscle fatigue monitor 3 has been used 
extensively for studying muscle fatigue associated with 
lower back pain, and a similar device based on a digital 
signal processing board has been described by Balestra et 
al. ‘ In such applications, computers are necessary to real¬ 
ize devices suitable for clinical use. 

Noninvasive fiber typing. We can divide muscle fi¬ 
bers into several families that differ in their metabolism, 
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continued from p. 53 

mechanical properties, and cross-section and electro- 
physiological characteristics. The distribution of differ¬ 
ent muscle fibers inside a skeletal muscle is related to 
the particular muscle, its activity, and the subject’s train¬ 
ing protocols and age. Neuromuscular diseases can cause 
a variation of the population of fibers inside a muscle, 
thus changing its properties and characteristics. 

Data about muscle fiber constituency help a clinician 
trace the progression of neuromuscular diseases or the 
effectiveness of a training protocol. Although precise in¬ 
formation about fiber distribution inside a muscle is ob¬ 
tained only with a biopsy, qualitative data can be obtained 
by analyzing the time course of SMES parameters and 
conduction velocity during fatiguing contractions. In fact, 
spectral parameters and muscle fiber conduction veloc¬ 
ity relate to the size and type of muscle fibers. 

By analyzing the surface myoelectric signal, a clini¬ 
cian can infer the recruitment order of motor units dur¬ 
ing voluntary and electrically elicited contractions. 
According to Henneman’s size principle, during volun¬ 
tary’ contractions of increasing force, larger motor units 
(made up of larger muscle fibers) are progressively 
recruited. 

The analysis of the motor unit recruitment order is 
particularly important when studying pathologies known 
to affect it substantially. Also, for applications using func¬ 
tional electrical stimulation (for example, gait restora¬ 
tion in paraplegic patients), the recruitment order is 
important because it affects the endurance of electrically 
stimulated muscles. 
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Different algorithms to obtain a reliable decomposition 
procedure have been proposed in literature. 6 ' 8 For clinical 
purposes, physicians most often consider the MUAP mor¬ 
phological characteristics, while researchers examine both 
the MUAP shape and the motor unit’s firing history to inves¬ 
tigate motor unit properties and the strategies implemented 
by the central nervous system for motor control. 

Figure 4 shows how a typical decomposition algorithm 
works. The input is the interference pattern collected by a 
needle electrode inserted into the muscle to be studied. Such 
input appears as a noiselike process. Here, the procedure 
must recognize the contribution of single motor units, clas¬ 
sify the different MUAP shapes, and extract the firing history 
of each active motor unit. The output of the algorithm is a 
database with the shapes of all the MUAPs detected by the 
system and a structure of their firing histories. 

The accuracy of a decomposition algorithm used in re¬ 
search must be very high. In 1973, Shiavi and Negin demon¬ 
strated that a misclassification of one or two MUAPs out of 
100 may compromise the possibility of observing important 
characteristics of the motor unit behavior. 1 

Statistical analysis of the motor unit firing histories typi¬ 
cally requires the acquisition and processing of signal records 
ranging from 5 to 60 seconds long. Moreover, to obtain an 
acceptable reconstruction of the MUAP shape in the time 
domain, systems oversample the signal at sampling frequen¬ 
cies as high as 50 kHz. To improve recognition and classifi¬ 
cation algorithm performance, several channels are acquired 
simultaneously, thus increasing memory use and processing 
time considerably. 

From a technical point of view, the essential problems are 
data compression and computational speed. Besides, current 
decomposition algorithms generally require a large amount 
of man-machine interaction, so graphics must be advanced 
to facilitate and speed the operator’s work as much as pos¬ 
sible. Until a few years ago, only minicomputers with large 
mass memory were suitable for processing these algorithms. 
Now, powerful PCs and workstations support most decom¬ 
position programs, and accurate and reliable systems are 
available and affordable. Thus, in the near future, EMG 
decomposition will probably become a widely used 
technique. 

The following subsections outline the steps generally in¬ 
volved in a decomposition technique. 

Signal acquisition. Usually the system acquires data from 
more than one channel, and the sampling rate may range 
from 10 kHz to 50 kHz. The duration of the acquisition may 
range from a few seconds to a minute. During acquisition, 
the computer screen displays the signals to allow the opera¬ 
tor to control the quality of the collected signals. This task is 
particularly important because of the small detection volume 
of the needle electrodes that collect the signals. Even a small 
movement of the needle may totally change the pool of motor 
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Figure 4. A sketch of the decomposition process. The waveforms reported in 
the output section represent the MUAP shape of three different motor units 
that were identified by the process. The diagram on the right side is a time dia¬ 
gram of the firing history of the corresponding motor unit. Each bar corre¬ 
sponds to a firing occurrence. 


units observed, thus compromising the possibility of assess¬ 
ing the firing behavior of a certain pool of motor units over a 
sufficiently long time interval. 

The acquisition system consists of an IBM PC AT-class 
machine or a workstation with a suitable analog-to-digital 
converter board and a 19- or 20-inch high-resolution screen. 
The tasks in this phase are data collection and simultaneous 
display. A data-compression algorithm is usually necessary 
to reduce signal redundancy and avoid the storage of too 
much data. Intelligent analog-to-digital converter cards with 
their own microprocessor system and possibly a DSP chip 
can greatly facilitate the on-line processing and display of the 
signals. 

Identification of individual MUAPs. The identification of 
individual MUAPs often requires an improvement of the sig- 
nal-to-noise ratio. Powerful techniques for noise reduction 
are currently available—for example, optimal Wiener filter¬ 
ing, adaptive filtering, and noise reduction through the use 
of high-order spectra. But identification remains a challeng¬ 
ing task, especially if the initial signal-to-noise ratio is rela¬ 
tively low. For data compression to optimize memory 
occupation, a threshold value is set either automatically or 
manually, taking into account the statistical properties of the 
recorded signal. The acquired samples are compared with 


that threshold. Only values over the 
threshold are saved, so memory isn’t 
wasted to record noise. 

Resolution of superpositions. The 
summation of different MUAPs belong¬ 
ing to motor units that fire simulta¬ 
neously leads to a waveform in which 
distinguishing each single MUAP is gen¬ 
erally difficult. Such a waveform is 
called a superposition. At low contrac¬ 
tion levels (below 20 to 30 percent of 
the maximum voluntary contraction 
force) the MUAP shapes are usually 
well separated, but the likelihood of 
observing superpositions rises with in¬ 
creasing contraction levels. At contrac¬ 
tion levels higher than 60 to 80 percent 
of the maximum voluntary contraction 
force, most MUAP shapes are super¬ 
imposed, making it difficult to identify 
the active motor units. 

The algorithms used to separate the 
complex waveforms into their constitu¬ 
ents are very time-consuming. In fact, 
when distinguishing any previously 
classified MUAP shape is not possible 
and the record under observation can¬ 
not be classified as a new MUAP shape, 
superpositions are typically resolved by 
computing the probability that every motor unit previously 
classified fires at a given instant. After finding the motor unit 
that maximizes the probability of firing at that time, the pro¬ 
gram subtracts the corresponding MUAP shape from the 
record. The same algorithm is recursively applied to the re¬ 
maining signal until no more MUAPs can be identified. The 
program evaluates the result according to a set of rules, and, 
if it is not satisfactory, starts the process again and alerts the 
human operator to control its progression. 

MUAP classification. The classification techniques used by 
most decomposition algorithms are based on pattern match¬ 
ing. At the beginning of the analysis of a new signal record, 
both the number of motor units whose firing can be detected 
and their characteristics are unknown. When a new motor 
unit is identified, its shape is recorded and considered as a 
new template. This template is then added to the template 
database and used to classify other MUAPs belonging to the 
same motor unit. Every time the program recognizes a wave¬ 
form as generated by a specific motor unit, it updates the 
corresponding template to keep track of the MUAP modifica¬ 
tions caused by fatigue, small electrode movements, or other 
physiological variations. Artificial intelligence techniques can 
improve both the accuracy and speed of the classification 
process. 
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Measuring temporal and morphological information. The 
outcomes of the decomposition process are the characteris¬ 
tics (MUAP shape, firing history, and so on) of each detected 
motor unit. At the end of the process the researchers evalu¬ 
ate these features to extract the physiological information. 

If the purpose is to better understand the control proper¬ 
ties of motor units, researchers are interested mostly in their 
firing histories. Firing history is usually represented as a se¬ 
quence of 5 functions placed along the time axis. Figure 4 
shows an example of such a plot. The distance between two 
subsequent firing occurrences is called the interpulse inter¬ 
val. Counting the number of firing occurrences of a particular 
motor unit within a given period of time and dividing that 
number by the period of time itself yields the so-called mean 
firing rate, which correlates with physiological and anatomi¬ 
cal characteristics of the muscle fibers that constitute that 
motor unit. 

Typically, physicians can extract information about the 
physiological condition of muscle fibers belonging to a given 
motor unit by observing variables related to the MUAP shape. 
Figure 5 summarizes the main variables related to the MUAP 
shape, which is generally computed by averaging several 
MUAPs belonging to the same motor unit to reduce pulse-to- 
pulse variability. 



Figure 5. MUAP shape characteristics. 


Decomposition using artificial intelligence. In 1988, 
Broman proposed a decomposition program based on AI 
techniques. 9 The program consists of three planes: control, 
module, and data. The data plane contains the raw and pro¬ 
cessed data, as well as data to control the selection of the 
computing modules. The module plane contains the basic 
processing modules that deal with raw data, and the refine¬ 
ment modules applied to processed data. The module plane 
also contains an interface between the data plane and the 
model in the control plane. The control plane contains 

• a priori knowledge about the physiology of the myo¬ 
electric signal and possible artifacts of the recording 
technique, 

• a scheduler that selects the appropriate module for the 
actual processing, and 

• a dynamic model with information about the decompo¬ 
sition process under development. 

This approach significantly increases the computational 
speed and dramatically reduces the need for human interac¬ 
tion (at least with signals without numerous and complicated 
superpositions). Nevertheless, it is still far from being ad¬ 
equate to extensive clinical application. 

Macro EMG. Stalberg 5 described the macro EMG technique 
in 1980 to extract potentials belonging to a single motor unit 
for motor unit classification and study. General-purpose needle 
electrodes have a detection volume large enough to collect 
MUAPs generated by muscle fibers belonging to a number of 
different motor units. In pathological conditions, normal and 
pathological motor units can coexist in the same muscle. Thus, 
with general-purpose needles it is generally difficult to ob¬ 
tain information about either motor unit type. 

The macro EMG approach involves an operation that aver¬ 
ages a given number of frames of myoelectric signals col¬ 
lected by the general-purpose needle electrode. The 
single-fiber action potential generated by a fiber belonging 
to the motor unit to be studied triggers the averaging pro¬ 
cess. We adjust the position of the single-fiber needle to se¬ 
lect a stable single-fiber potential to trigger an averager. The 
macro electrode collects signal epochs of some tens of milli¬ 
seconds (typically 60 ms) after each triggering event. These 
are fed to the averager, which returns the macro potential 
corresponding to the fiber potential trigger. If we assume the 
firing histories of the motor units detected by the large needle 
electrode are uncorrelated, and if the number of averaged 
frames is large enough, the averaging process should lessen 
potentials generated by all the motor units except those syn¬ 
chronized with the trigger. 

By changing the position of the single-fiber needle in the 
muscle, we can trigger the averager with muscle fiber poten¬ 
tials belonging to different motor units. This lets us extract 
consecutively from the background noise the macro MUAPs 
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that occur time-locked to each fiber spike train used as a 
trigger source. Thus, we can determine the number of muscle 
fibers that constitute the motor unit, and obtain information 
about the temporal and spatial scattering of muscle fiber de¬ 
polarization inside each motor unit. This technique is very 
useful because it offers information when other techniques 
give insufficient or misleading results. 5 

Systems designed to support macro EMG analysis resemble 
those used for myoelectric signal decomposition. They ac¬ 
quire two signals of interest (concentric and the macro EMG) 
at a sampling rate ranging from 10 kHz to 50 kHz. Epochs as 
long as 100 to 120 seconds are generally studied. 10 

Expert systems 

In the last 10 years, expert systems to help the physician 
diagnose neuromuscular disorders have been developed. From 
those described in the literature, we consider some systems 
to be typical: 

• The Localize system localizes the site of a lesion within 
peripheral nerves using muscle-strength testing (1982). 11 

• Myosis diagnoses mono- and polyneurophaties using the 
myoelectric signal (1985). 12 

• Blinowska and Verroust’s system helps the physician 
diagnose the carpal tunnel syndrome (1987). 13 

• A system described by Gallardo et al. helps in the study 
of plexus and root lesions (1987). 14 

• Kandid assists in the diagnosis of the complete range of 
neuromuscular disorders and supports the physician in 
planning an optimal sequence for testing. 15 

The most complex expert system in the electromyographic 
field today is Munin. This knowledge-based system supports 
an electromyographer in the various steps of an EMG exami¬ 
nation performed with needle electromyographic techniques. 16 

The reasoning performed by Munin is based on hypoth¬ 
esizes nd-test methodology. A typical consultation starts with 
the physician entering the patient’s anamnestic data. Using 
this information, the system generates a ranked list of EMG 


tests, ordered according to their expected diagnostic discrimi¬ 
nation. The electromyographer selects a test, and Munin helps 
carry it out. 

Munin uses the results to perform a two-step reasoning 
process. The first step describes the pathophysiological 
changes of the muscles and nerves involved in the test, while 
the second lists the diseases that may cause the changes rec¬ 
ognized by the first step. To each disease the system assigns 
a probability score. At this point, the physician may either 
select a new test or accept the diagnostic results. If a new test 
is performed, Munin updates the list of the possible diseases 
and their probabilities. The consultation stops when a diag¬ 
nosis is reached, when no more tests are available, or when 
the disadvantages of performing new tests are greater than 
the expected diagnostic benefits. 

Modern EMG SIGNAL ANALYSIS STARTED in the early 
1960s, when the first computer systems became available to 
researchers. Only recently has the availability of micropro¬ 
cessor-based personal computers and workstations allowed 
clinicians to apply the signal processing techniques previ¬ 
ously developed in the field of basic research. Here, we re¬ 
viewed the most recent applications of computers to the 
analysis of the EMG signal, specifically describing several tech¬ 
niques developed in the last two decades. We tried to focus 
our attention on the procedures that are presently part of 
standard clinical practice; however, the development of new 
techniques is ongoing and the field of applied research con¬ 
tinuously expanding. 

We are currently able to extract only a small amount of the 
information on the neuromuscular system that the EMG sig¬ 
nal embodies. The challenge for the future is to increase as 
much as possible our ability to understand the neuromuscu¬ 
lar system by continuing to analyze the electrical activity of 
muscles and nerves. P 
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continued from p. 19 

must match the exported definitions of the server module. 
Figure 4 shows a client that calls a server procedure when 
both modules are directly linked. The shaded key form sym¬ 
bolizes the interface. 

Remote procedure calls 

A remote procedure call looks exactly like a local proce¬ 
dure call to another module of the same task. For instance: 



Imported Exported 

interface interface 

(procedures and data types) 


Figure 4. Interface of a called procedure. 


Bank.Getltem (“SMITH”, Salary); 

This call is issued by a client program to a server procedure 
Bank.Getltem. The client is suspended until the call returns. 
That is exactly how the familiar local procedure call works. 
Figure 5 shows a remote procedure call as a function of time. 

In a local call, the called procedure belongs to another 
module of the same task. Communication takes place di¬ 
rectly through procedure call mechanisms the compiler pro¬ 
vides. This is the view of communication the programmer 
ideally should have. 

In a remote call, the server belongs to a different task of 
the same or a different processor. For the client, however, 
everything should happen as if the call were local. In reality, 
the call is forwarded over the communication path. 
Bank.Getltem is executed by another task in the same com¬ 
puter (internal call) or in another computer (external call), 
which can be a different type of machine or can be running 
a different operating system. The results are returned in an¬ 
other message, as shown in Figure 6. 

Remote procedure calls are very similar to local procedure 
calls. For instance, remote calls may be nested at will. But 
there are some differences: 

• Local variables. Although most languages permit a mod¬ 
ule to export local variables, this is not allowed if the 
module is to be distributed; client and server residing on 
different sites share no common address space. There¬ 
fore, no reference to the address space of either the 
client or the server (common variables) may be used. 
This rule is in accordance with object-oriented program¬ 
ming. All information is transmitted in the parameters of 
the called procedure. In remote calls, parameters are 
always transmitted by value (even if the programmer 
passes parameters by reference). 

• Parallelism. In contrast to a local procedure call, the 
same service may be called again by another client while 
a remote procedure call is executing. For each client, a 
different instance of the service is started. Each service is 
executed by a different thread. When the threads are 
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Figure 5. Remote procedure call. 

exhausted, the calls are queued. The threads need syn¬ 
chronization among themselves to access common 
resources. 

• Blocking calls. Although client and server may be ex¬ 
ecuted by independent processors, they do not execute 
in parallel; the client remains blocked until the call re¬ 
turns, according to the semantics of a procedure call. A 
client may wait for the result of only one call at a time. 
This causes no loss of parallelism, however. The proces¬ 
sor that was executing a blocked client is free to execute 
another service, the same service again (on behalf of 
another client), or another client until the call returns. 
When a call is blocked, another activity can start or re¬ 
sume execution. 

• Nonblocking calls. In addition to blocking calls, a client 
may start multiple calls, provided these calls do not re¬ 
turn results. Such calls are called remote procedure in¬ 
vocations. Unlike RPCs, RPIs require an underlying flow 
control mechanism, invisible to the user. A client may 
not wait for the completion of an RPI. Unlike Delta-4, 
Alphorn supports no asynchronous call with results (Re¬ 
quest Service Request/Remote Service Wait), because 
this construct loopholes the structure. (In ANSA, block¬ 
ing and nonblocking calls are called synchronous and 
asynchronous.) 

• Events. Normally, a remote call is a one-to-one relation. 
Alphorn has no broadcast calls but can publish an event 
from a signaler to subscribed handlers. Programmers 
insisted on having this construct, but it does not support 
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fault tolerance and we therefore do not encourage its 
use. 

Stubs 

The entity in charge of relaying the call is called a stub (see 
Figure 6). The client stub plays the role of a server for the 
client. The client accesses the client stub in the same way it 
would access a local service procedure. The client stub con¬ 
verts the call into a message, which is sent over the network 
or over shared memory to the destination site. There, it is 
received by the server stub, which plays the role of a client 
for the server on that site. The server executes the service 
and returns to the server stub, which forwards the result to 


the client stub. The client stub then returns the results in the 
same way a local service procedure would do. 

Some languages, such as Argus, 9 provide remote proce¬ 
dure calls as programming constructs. Most other languages 
cannot express that a procedure is called remotely or declare 
a module as server. Nor is this necessary, since we want to 
handle local calls and remote calls identically. To preserve 
the same interface for remote procedure calls as for local 
calls, we use stubs, not only for communication, but also as 
interfacing tools. The client stub presents the same interface 
to the client as the actual server would. In fact, the client stub 
simulates the existence of a local server. The server stub simu¬ 
lates a local client call to the actual server. The call interface 
is the same as it would be if the server were called locally. 
This is illustrated in Figure 6. 

Stubs do much more than just transform a local call into a 
remote call. In fact, they are the centerpiece of the communi¬ 
cation process. Stubs initialize and administer communica¬ 
tion, control the success of calls, publish services over the 
network, and provide the hooks for debugging tools. In fault- 
tolerant systems, stubs control the correct replication and syn¬ 
chronization of redundant processors. 10 In heterogeneous 
systems, stubs control the transformation of data formats. 11,12 

Automatic stub generation. The client and server stubs 
are normal modules, linked to the client and server respec¬ 
tively and forming part of their task. They could be written 
by hand. However, it is much more convenient and safer to 
generate them automatically with a stub generator. 13 The 
RPCGEN stub generator analyzes the file that defines the server 
module. This server definition file is, of course, common to 
the client and server modules. RPCGEN generates the imple¬ 
mentation modules of both the client and server stubs. The 
principle of stub generation is shown in Fig¬ 
ure 7. 

Each client receives a copy of the client 
stub. The stub generator also examines the 
symbol file generated by the compiler to 
Server detect version conflicts. The symbol file con- 

site tains a version key, which the stub transmits 

with each message to enforce type check¬ 
ing over the network. 

Service interface language 

All RPC systems rely on a server defini¬ 
tion file. It is called the network interface 
definition file in Apollo’s NCS and the I/F 
specification in Delta-4. 

The server definition file could be written 
in a standard computer language, like a defi¬ 
nition file in Modula-2. Although simple ap¬ 
plications could run with some default 
settings, this solution is not general enough. 
The stub generator needs additional infonna- 


Server interface 



Figure 7. Stub generation. 
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tion which is not relevant when a procedure is called directly 
from within the same task. For instance: 

• Procedure type. Remote procedure calls (blocking) and 
remote procedure invocations (nonblocking) are distin¬ 
guished by an “RPC” or “RPI” prefix. 

• Procedure parameters. Since clients and servers do not 
share a common address space, the stubs copy all param¬ 
eters, even those passed by reference (pointer). To avoid 
unnecessary copying, we distinguish parameters that the 
procedure may only read (input parameters, or “argu¬ 
ments”), parameters the procedure may modify (output 
parameters, or “results”), and parameters the procedure 
both reads and writes (input/output parameters). Ada is 
one of the few languages that tags the parameters as IN, 
OUT, or INOUT. We use the same convention in Alphorn. 

• Dynamic-size parameters. The size of this type of pa¬ 
rameters is not part of the interface, since it is known 
only at runtime. A dynamic array declaration allows us 
to copy these parameters correctly. 

• Parallelism. The degree of parallelism (that is, the num¬ 
ber of server threads) and the degree of fault tolerance 
(the number of replicates) can be indicated in the server 
definition file. This data can be modified at configura¬ 
tion time. 

• Presentation. A description of the target machine is in¬ 
cluded in the interface definition so that the stubs can 
convert types automatically. Since the data type is part 
of the server interface, the client must adapt. For in¬ 
stance, a client stub converts 32-bit integers returned by 
a little-endian (Intel) server to its own big-endian (Mo¬ 
torola) format. Since the parties agree on the data repre¬ 
sentation at compile time, there is no need for an on-line 
negotiation of data types. This avoids a presentation layer 
in the communication process. A unified data type pre¬ 
sentation can also be specified, similar to Sun’s XDR. 5 

• Initialization. Some clients come into existence when 
the system is started. These “top actions” are specified 
as services called from nowhere. 

The server interface is expressed in a service interface lan¬ 
guage. ECMA, ROSE (Remote Operations Service Elements, 
ISO 9072), and Delta-4 use the standardized Abstract Syntax 
Notation 1 (ASN1, ISO 8824) to achieve independence of any 
computer language. However, ANSA and IEC SC21 consider 
ASN1 suitable for expressing message structures but too weak 
as an interface description language; they have developed 
their own interface description language, derived from Xerox’s 
Courier protocol. 14 

For its service interface language, Alphorn uses the syntax 
of a nonnal Modula-2 definition file, in which the stub con¬ 
trol instructions are enclosed in comments. This allows fea¬ 
tures not found in ASN1. The main advantage is that the 


Modula-2 compiler can process the file directly. This is all 
that is needed to run within a single task. To make a module 
remotely accessible, the stub generator analyzes the stub con¬ 
trol information enclosed in the comments and the symbol 
file (to retrieve the version key). 

The use of Modula-2 has an interesting by-product: Modula- 
2, like most languages, was created as a sequential language 
and not to support parallelism. As such, it defines a static 
structure. Clients and servers, however, are dynamic entities. 
The Modula-2 syntax is used here to express a dynamic struc¬ 
ture, which becomes apparent only at runtime. This interpre¬ 
tation is done by the stub generator. 

Configuration 

Once the servers, the top client modules, and the server 
stubs have been compiled, they are configured; that is, the 
entities belonging to the same site are linked together to form 
one runnable task. There may be several tasks in the same 
node, each one comprising servers and top modules. Server 
modules are not active by themselves; they need a client to 
call them. Top actions become active when the task is started. 
When they call a service, they become top clients. A task is 
formed by a main Modula program that directly imports the 
top modules and the server stub modules (Figure 8). The top 
modules import client stubs as needed, while the server stub 
modules import their corresponding server modules. 

The main program provides the Runtime Support. (RTS is 
similar to Deltase in the Delta-4 project and to the ANSA 
nucleus.) It imports both the top clients and the server stubs. 
The main program creates and manages a number of threads. 
Conceptually, all these threads are parallel—that is, they would 
execute in parallel if there were a sufficient number of pro¬ 
cessors. They form a pool that runs the servers and top ac¬ 
tions. The runtime support provides the communication 
interface. 

The writing of the main program is automated by the 
OBJGEN utility. The OBJGEN configurator generates a main 
program, which is a normal Modula-2 program, to hold the 
top client, client stubs, server stubs, and server modules. 
OBJGEN interprets the definition files as follows: 

• For each procedure (without parameter) exported by a 
top module, a thread is created and started. These threads 
form the top actions. Conceptually, all top actions are 
started simultaneously. A top action becomes a top cli¬ 
ent when making a remote call. 

• For each server stub, several server threads, or servers 
for short, are created. When an action calls a service, 
one of these servers will execute the corresponding pro¬ 
cedure as a server action. The number of server threads 
can be specified in the definition file during stub gen¬ 
eration. The default value is 3 in the present implemen¬ 
tation. 
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Runtime and debugging 
tools 

A running system consists of a 
number of nodes, each running sev¬ 
eral application and system tasks. 

Network communication is only par¬ 
tially realized in the stubs. Other enti¬ 
ties must exist to support them, in 
particular the networker. The net¬ 
work configuration is shown in Fig¬ 
ure 9 on the next page. 

Each task may contain servers and 
top actions. Top actions usually call 
services of other servers, but they 
need not. For instance, in Figure 9, 

Task 2, the top action of node Alpha, 
needs no access to a stub since it per¬ 
forms no remote calls. 

Networker. A stub performs only 
part of the communication work. Ba¬ 
sically, it provides the glue between 
the application interfaces and the 

standard low-level RPC mechanism. A driver called the 
networker executes the actual communication. The networker 
is permanently active; that is, it is a separate task or a device 
driver (except on single-task systems like MS-DOS, where it is 
linked to the application). This driver is common to all applica¬ 
tions in a node, and it hides the details of communication from 
the stubs. 

The networker executes the RPC protocols. It analyzes the 
available communication links—for instance, the shared- 
memory communication module, the Ethernet driver, DECnet 
to access another computer, or X.25. When the networker rec¬ 
ognizes that client and server are in the same node, it connects 
the stubs directly through shared memory. Similarly, when a 
DECnet link exists between client and server, the networker 
uses its facilities directly. 

Name server. When a client calls a server in the network, its 
networker must know the location of this server. That could be 
specified during configuration by early binding. Alphom in¬ 
cludes and removes separately compiled modules as the sys¬ 
tem runs. It uses late binding to indicate the location of the 
servers at nintime. Each networker performs the name server 
function—there is no central trader. The name server publishes 
all services in the network. (It does not publish their interfaces, 
however, unlike the Delta-4 trader. That would have required 
each node to have mass storage.) A multicast protocol continu¬ 
ally actualizes all name servers. 

The same service may be installed several times in the sys¬ 
tem, to increase performance or fault tolerance. In the first case, 
the user may wish to choose and distinguish which server is 
processing the call. This is especially important for exception 
handling. Thus, the name server provides a mechanism to 


Top modules 


Server modules 



Figure 8. Configuring a task with top and server modules. 


identify the client and the server that participate in a call. When 
a service is replicated for fault tolerance, both the primary and 
the backup carry the same identification. The networker moni¬ 
tors the start and termination of server and clients, notifying all 
interested applications. 

NetMonitor. During development and test, it is useful to 
have an insight on the way communication takes place and 
which entities are active at a time. The NetMonitor utility runs 
as a task anywhere in the network. It monitors existing sites, 
traces established connections, lists clients, servers, handlers, 
and signalers, and displays statistical information. Finally, 
NetMonitor can stop sites in an orderly way. 

Fault tolerance support 

There is no general-purpose fault-tolerant computer. To 
handle redundancy effectively, a fault-tolerant system must 
take account of the plant or other process control application. 
Fault tolerance therefore requires close collaboration between 
the application programmer and the system designer. Alphom 
provides the following basic tools to help them construct fault- 
tolerant systems: 

Exception handling. As is not the case with a local call, 
one must be prepared for the failure of a remote call. To main¬ 
tain the same interface as for local calls, a call’s status is deliv¬ 
ered by a “hidden” module called RPC Status. Since the client is 
blocked during the call, a failure of the server must be signaled 
by an exception handler. This is a critical component, since 
some machines and languages support exception handling 
well, such as VAX/VMS, while others seem to ignore the prob¬ 
lem. Much of the porting work for Alphom consisted of imple¬ 
menting exception handlers for machines that lacked them. 
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Node Alpha 


Node Beta 



Node Gamma 


Figure 9. Network at runtime. 


Error reporting. Each node constantly monitors the net¬ 
work. Healthy nodes regularly communicate or send “I’m 
alive” messages; the absence of messages triggers reconfig¬ 
uration. Therefore, the worst-case error detection latency is 
equal to the network’s cycle time. This mechanism is part of 
the basic runtime support. 


Redundant distributed system. In process control at the 
supervisory level, the main requirement is high availability. 
Basic availability is best provided by a duplex structure. Func¬ 
tional redundancy is provided by additional nodes in the 
network and duplication of communication links. When a 
node fails, its backup node takes over its function. The backup 
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Figure 10. Resilient remote procedure calls. 

node is functionally redundant to the on-line node. In par¬ 
ticular, the backup has the same access to I/O devices as the 
on-line node, either through dual-ported devices or redun¬ 
dant devices. 

Not all nodes need to be duplicated. The Alphorn fault 
tolerance concept supports both replicated and nonreplicated 
nodes in the same system. Higher levels of replication such 
as triplication (two backups) may also be integrated. Most 
fault-tolerant applications in Alphorn, however, are expected 
to be of the duplex type. 

Error detection. The failure of a node is detected by means 
such as parity checks, self-checking circuits, watchdog tim¬ 
ers, and the like. We assume that each node is fail-stop—it 
will stop sending data in case of failure (“fail-silent” in Delta- 
4). For process control at the supervisory level, integrity is 
not critical and the self-checking provided by commercial 
mainframes is generally sufficient. Full integrity—no false data 
in case of failure—is mandatory in critical applications such 
as railway signaling. In such cases, each node must be dupli¬ 
cated to provide sufficient coverage. 

It makes no sense to increase fault detection coverage of 
hardware past a certain point. Software causes many errors, 
which can hardly be caught by hardware means. That is why 
Alphorn provides no protection against malicious behavior 
or babbling nodes. We do not try to provide tolerance of 
software errors. However, it is important to report and treat 
software errors, even in nonredundant cases, so that the fail¬ 
ure of one unit will not cause others to crash. For this pur¬ 
pose, Alphorn relies on its exception-handling mechanism. 

Redundancy actualization. It is not enough to add re¬ 
dundant nodes to a network. Unless the state of backup nodes 
is close enough to that of their on-line units, switchover will 
be rough or even impossible. There are two basic techniques 


to keep backup computers actualized: 
standby, or periodical update, and workby, 
or parallel operation (called passive and ac¬ 
tive replicates in Delta-4; coordinated and 
parallel replica in ANSA). 15 

• In the standby (or asynchronous) mode, 
a backup node is regularly actualized 
by transfers from the on-line node. 
These state transfers take place at check¬ 
points. Otherwise, the backup is free to 
perform other tasks. 

• In the workby (or synchronous) mode, 
the on-line node and the backup node 
perform the same tasks at the same time. 
Therefore, their internal states remain 
identical. In principle, their outputs 
could be compared to detect errors. The 
backup cannot be used for other tasks. 
Workby requires synchronization and 

matching to remove all sources of nondeterminism, which 
could let the units diverge. Synchronization and match¬ 
ing over the network costs time. 16 


Alphorn supports both operation modes, but unlike ANSA 
and Delta-4, Alphorn supports no replica groups. In either 
mode, the backup must constantly monitor the network 
acvivity to reproduce or coexecute the actions of the pri¬ 
mary. In both cases, a teaching mechanism updates a fresh 
node to backup status. 

Resilient remote procedure calls. Standard RPCs can 
be easily extended to support both the standby and the workby 
modes of operation. 17 The basic idea is that actualization of 
the backup, whether through checkpointing or synchroniza¬ 
tion, requires that communication over the network be moni¬ 
tored. To this effect, both the on-line nodes and their backups 
receive call and return messages (Figure 10). 

For instance, when a call is made, it is received by the 
server and by the backup server. If the server fails, the backup 
will have received the service call and can execute it. Or, if 
the client fails, its backup can receive the return message in 
its place and proceed. Thus, each RPC becomes an implicit 
checkpoint. This method effectively eliminates the domino 
effect. 18 

Causal broadcast. Actualization and synchronization of 
replicates over the network require reliable multicast com¬ 
munication. (Although some local networks provide acknowl¬ 
edged broadcast transmission, this property cannot be 
expected of open networks.) Rather than a full-fledged atomic 
broadcast protocol, Alphorn relies on a causal broadcast pro¬ 
tocol. 10 A causal broadcast protocol ensures that message order 
is maintained under all circumstances for all interested par¬ 
ties. This protocol is built on top of existing protocols such as 
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DECnet or X.25. Causal broadcast has proved to be very effi¬ 
cient and handy, even in nonreplicated applications. (Unlike 
Delta-4, this protocol does not provide authentication or sig¬ 
natures; protection against intrusion is unnecessary in a dedi¬ 
cated system.) 

Application dependence. Ideally, the application pro¬ 
grammer writes programs without caring about replication. It 
is indeed desirable to offer, for instance, a control system in 
a redundant and in a nonredundant configuration. There¬ 
fore, the code running in the on-line unit and in the backup 
unit must be identical. Redundancy is introduced at configu¬ 
ration time and hidden in the runtime system or in the device 
drivers. This ideal cannot be completely maintained. 

Fault tolerance constructs. Alphorn provides useful con¬ 
structs to support fault tolerance. While each RPC provides a 
synchronization/rollback point, it may be necessary to in¬ 
clude checkpoints at closer intervals, especially for top ac¬ 
tions, which are never called as server. To this effect, Alphorn 
provides an iteration construct, which inserts a checkpoint 
each time a loop is executed. 

In standby mode, data modified in the primary node are 
transmitted to the backup unit. Indiscriminate checkpointing, 
as used in transaction computers, is not acceptable in real¬ 
time systems. Alphorn provides a construct for tagging data 
structures that are subject to modification. Only these data 
will be transmitted to the standby when the call returns. 

This checkpointing allows us to build atomic actions, which 
are either completely executed or not at all. When a failure 
occurs, computation starts from the last checkpoint (rollback) 
and proceeds to repeat operations already done by the failed 
unit (rollahead), until reaching the failure point. The rollahead 
procedure avoids repeating output operations already per¬ 
formed by the failed unit. But in some situations, it may be 
necessary to redo such operations—for example, refreshing 
the operator’s screen. Here, the application dependency must 
be hidden in the I/O drivers. Alphorn does not, however, 
provide atomic transactions; this is considered an application 
issue. 

Teaching backups. The teaching of new backups—for 
instance, after repair—is performed in the background and 
consists of a dialogue-like communication between an on¬ 
line task and the future standby task, to transmit the static 


state of an object. Depending on the amount of spare com¬ 
puting time, two options are available: concurrent teaching 
goes on during normal operation but increases load. Off-line 
teaching requires a small interruption but costs no runtime 
overhead. 


The original goal of the Alphorn project was 

to develop a fault-tolerant computer network. Later, the soft¬ 
ware environment and the network-independent communi¬ 
cation based on RPC became the mainstream of the project; 
Although based on Modula-2, Alphorn is not limited to that 
language. Stubs may be easily generated for other languages, 
such as C. In fact, the target language, machine, and com¬ 
piler are options of the stub generator. A Pascal version has 
been developed for a machine that has no Modula-2 com¬ 
piler. In collaboration with the Swiss Federal Institute of Tech¬ 
nology, in Lausanne, we are working to provide manufacturing 
message services (ISO 9506) as Alphorn services. Alphorn is 
now used for a distributed database implemented on behalf 
of the Swiss PTT (post office), consisting of 200 nodes scat¬ 
tered over Switzerland. That is in keeping with the project’s 
name: alphoms are traditional Swiss musical instruments de¬ 
rived from the long horns shepherds once used to call their 
herds across the valleys. (JB 
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Figure 4. VL86C310 VIDC block diagram. 


eluding the digital-to-analog converters for red, green, blue, 
and sound. (See Figure 4.) 

The VIDC supports pixel rates from 8 to 24 MHz, data 
serializing from 1 to 8 bits per pixel, fully programmable 
screen parameters, and flexible cursor sprites. 


I/O controller 

The VLSI VL86C410 I/O controller (IOC) shown in Figure 
5 provides a unified environment for I/O related activities 
such as interrupt and peripheral controllers. 

The controller contains elements such as timers, baud rate 
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Figure 5. VL86C410 IOC block diagram. 


generators, keyboard controller, and parallel I/O port control. 
Again, the IOC was designed as part of a system to maximize 
the performance/cost ratio. Only part of the data and address 
buses connect to the IOC. This small pin count helps in ASIC 
applications by reducing busing on the ASIC chip. 

Performance 

Let’s look at the VL86C010 performance rate while keeping 
in mind one caveat: Benchmarks should always be looked at 
with caution, even if you have created them yourself. Unless 
the benchmark is the actual debugged application code, it can 
be inapplicable or can overlook some operational detail that 
arises when using the code in a particular application. 

A published benchmark 2 for the Intel 376 family for bit- 
block graphics shows a final drawing rate of 17.067 million 
pixels per second at a processor clock rate of 16 MHz (see 


Figure 6a on the next page). The VL86C010 performs the 
same function (using different instructions, of course) with a 
processor clock of 12 MHz and achieves a drawing rate of 
30.77 million pixels per second (see Figure 6b). 

Another example (shown in Figure 7) involves an MC68020 
coded to solve z = 2 x + y, where x and y are variables that 
could not be destroyed. The example was coded two differ¬ 
ent ways with the variables in registers (already loaded) as 
seen in Figure 7a and the variables in a stack (stack pointer 
initialized) in Figure 7b. The first example took the Motorola 
processor eight clock cycles and the VL86C010 one clock 
cycle. The second example took the 020 processor 18 clock 
cycles Figure 7c) and the VL86C010 nine clock cycles (Figure 
7d). A final example (see Figure 8) demonstrates the effec¬ 
tiveness of our RISC’s architecture and conditional execution 
compared to that of the MC68020. 
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Word count = Length in words (32 bits) 



Bit offset value 
Destination_address 


Intel 80376 Example Code - 64-bit barrel shifter and 16-bit data bus 
MOV ESI, Source_address 
MOV EDI, Destination_address 
MOV EBX, Word.count 

MOV CL, Relative_offset EAX 

MOV EDX, [ESI] 

ADD ESI, 4 


31 


EDX 


BltLoop: LODSD 

SHRD EDX, EAX, CL 
XCHG EDX, EAX 
STOSD 
DEC EBX 
JNZ BltLoop 


Clock 

cycles 

7 

3 

3 

6 

2 

9 


(a) 


Totals 30 


Length 

(bytes) 

1 

3 

1 

1 

1 

2 

9 



;VL86C010 register usage 

;r1 - Source_address 

;r2 - Destination_address r3 - ddl data 

;r4 - dd2 data r5 - Relative_offset 

;r6 - 32 - Relative_offset, r7 - Word_count 

LDR rl, Source_address 
LDR r2, Destination_address 
LDR r5, Relative_offset 
LDR r4, [r1]+ 

RSB r6, r5, 32 


BltLoop: MOV 

r3, r4 LSR r5 

Clock 

cycles 

2 

Length 

(bytes) 

4 

LDR 

r4, [r1]+ 

3 

4 

ORR 

r3, r3, r4 LSL r6 

2 

4 

STR 

r3, [r2]+ 

2 

4 

SUB 

r7, 1 

1 

4 

BNE 

BltLoop 

3 

4 

(b) 


Totals 13 

24 


r3 r4 

31 0 



Figure 6. Bitblt example in graphics application: Intel 376 family at 16-MHz drawing rate of 17.067 Mpixels/second (a) and 
the VLSI Technology VL86C010 processor at the 12-MHz drawing rate of 30.77 Mpixels/second (b). 
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Calculate z = 2x + y where x and y are register variables that 

cannot be destroyed. 


move.l rx, rz 

; (2) rz=x 

Isl.l #1,rz 

; (4) rz=2x 

add.l ry, rz 

; (2) rz=2x+y 

Requires three 16-bit instructions and eight clock cycles (cache 

case). 



(a) 

add rz, ry, rx, Isl #1 ; (1) rz=2x+y 

Requires one 32-bit instruction and one clock cycle. 

(b) 

Using xand y as stack variables (stack pointer initialized), 
move offx(a6), rz ; (7) z=x 

Isl.I #1, rz ; (4) z=2x 

add.l rz, offy(a6) ; (7) z=2x+y 

Requires two 32-bit, one 16-bit instruction, and 18 clock cycles. 

(c) 


Idr 

rtmpl, offx(r13) 

; (4) tmp1=x 

Idr 

rtmp2, offy(r13) 

; (4) tmp2=y 

add 

rz, rtmp2, rtmpl Isl #1 

; (1) z=2x+y 

Requires three 32-bit instructions and nine clock cycles. 


(d) 


Figure 7. Code examples: with the variables in registers for 
the MC68020 (a), for the VL86C010 (b), with the variables 
in a stack for the MC68020 (c), and the same for the 
VL86C010 (d). 


A product growth path is essential to ensure continued 
speed and applicability. The performance figures generally 
given for most RISC processors are based on the use of static 
RAMs with state-of-the-art access times. For these processors 
to double their performance, the RAMs must perform twice 
as fast, and this follows right on up the performance curve. I 
have quoted VL86C010 performances based on the use of 
80-ns DRAMs, which are slower than the SRAMs. 

The newest member of the family, the VL86C020, contains 
an on-chip cache and produces 2.5 times the performance 
using the same DRAMs. Additionally, if an application war¬ 
rants it, the processor can be run faster using static RAMs. 
These CPUs have a robust growth path within the existing 
memory technology. 


Virtually all RISC CPUs on the market today 

have die sizes that are at the upper limit of that which is 
economically producible. This die size rules out their use as 


Logical Or function: 



If Rn = p OR Rm = 

p, GOTO LABEL 

cmp 

Rn, #p 

; (2) If Rn=p OR Rm-q then 

beq 

LABEL 

; (6) GOTO LABEL 

cmp 

Rm, #q 

;(2) 

beq 

LABEL 

;(6) 

Requires two 16-bit instructions and two 48-bit instructions when 
using long addressing (32-bit displacement) and 16 clock cycles 
when cache-resident and the condition is met by the second case. 


(a) 


cmp Rn, #p 

;(i) 

cmpne Rm, #q 

; (1) if Rn not equal p, try other 


; test 

beq LABEL 

;(4) 

Requires three 32-bit instructions and six clock cycles for the worst 

case in either case. 



(b) 


Figure 8. Code examples: Logical Or function for the 
MC68020 (a) and for the VL86C010 using conditional ex¬ 
ecution (b). 


cores in the design of other ICs. The VL86C010 RISC CPU, on 
the other hand, measures approximately 30 percent of the 
size of other RISC CPUs. In addition, the busing scheme of 
the entire family allows efficient designs in ASICs. In fact, our 
designers implemented the new VL86C020 RISC CPU with 
on-chip cache memory by placing the 010 core on a cell- 
based chip and then adding the RAM and control logic to 
form the cache. 

Certainly, if one is designing an embedded controller to¬ 
day that needs RISC performance, small size, and low power 
and cost with increased reliability and security, there exists a 
way to do it efficiently with ASICs. fll 
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continued from p. 27 

lines can be cascaded to become six stages (Figure 3). The 
i860 defines 32 different dual-operation opcodes, allowing 
add-before-multiply, multiply-before-add, and other combi¬ 
nations. Three 64-bit registers, called KR (constant real), KI 
(constant imaginary), and T (temporary), hold results be¬ 
tween the adder and multiplier. 

The other parallel feature is simultaneous execution of both 
an integer and floating-point instruction in dual-instruction 
mode (DIM). In this mode, explicitly controlled by the pro¬ 
grammer, the instruction cache fetches two instructions at 
once, one for the floating-point unit and the other for the 
integer unit (Figure 6). Designers created the dual-instruction 
mode to keep the floating-point unit (and graphics) running 
at full bandwidth, in spite of the need for instructions to 


SRC 1 


SRC 2 RDEST 


Multiplier 


Adder 



Figure 5. Floating-point adder and multiplier paths. 


D.PFADD 

F 

D 

El 

W E 

E3 



ADDU 

F 

D 

E 

W 




D.PFSUB 


F 

D 

El 

W E 

E3 


SUBU 


F 

D 

E 

W 



D.PFMUL 



F 

D 

El 

W E 

E3 

BR 



F 

D 

1 

W 


Clock cycle 0 

1 

2 

3 

4 

5 

6 


Figure 6. Dual-instruction-mode pipeline with two instruc¬ 
tions each clock cycle. 
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Figure 7. Typical external-memory pipeline hardware. 
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Figure 8. Pipelining for cache line fill. The four transfers, R1, R2, R3, and R4, 
each require two clock cycles at 40 MHz. 


move data between memory and regis¬ 
ters. With the pipelined floating-point 
instructions producing one or two re¬ 
sults every clock cycle, inserting the nec¬ 
essary load/store/branch instructions 
could halve the performance. Because 
the integer unit performs all memory 
accesses for the floating-point registers, 
it can accomplish data transfers with¬ 
out slowing the floating-point units. 

Dual-instruction mode is not needed 
with scalar-mode floating point, because 
two integer instructions can execute in 
the two clock cycles that occur between 
the initiation of consecutive floating¬ 
point scalar operations. 

Dual-instruction mode parallelism is 
similar to superscalar implementations 
such as the IBM RS/6000 and Intel i960 
processors. Superscalar hardware can 
determine at runtime opportunities for 
simultaneous execution of two or three 
instructions, without requiring explicit 
action by the programmer. However, 
the explicitly programmed dual-instruc¬ 
tion mode is easier to use than VLIW 
(very long instruction word) machines 
that cause compilers difficulty in find¬ 
ing enough useful instructions to pack into a long-word unit 
of execution. 

Programmers have put dual-instruction mode to effective 
use in the i860 software libraries for numerics and graphics. 
It doubles the performance of 3D rendering, fast Fourier 
transforms, matrix multiply operations, and other important 
kernels of code. Compilers will also exploit it, and future 
hardware can implement superscalar as well as DIM. 

External pipeline. Like the internal instruction execu¬ 
tion, the external bus of the i860 CPU exploits pipelining to 
increase bandwidth. Of course, the memory system can op¬ 
tionally ignore it for simplicity’s sake, but most existing sys¬ 
tems use this bus pipelining (Figure 7). They maintain the 
full bus bandwidth of 8 bytes transferred every two clock 
cycles, even though their access latency may rise to six cycles. 

The NA# (next address) input pin and the ADS# (address 
strobe) output pin control the external pipeline. ADS# indi¬ 
cates a valid address on the bus, and NA# indicates willing¬ 
ness by external memory to accept another request, even 
though the data for the previous cycle has not yet been 
accessed. The CPU can issue up to three requests for reads 
and writes before the first data transfer completes. The 
READY# input to the CPU signals valid data on the data pins 
for reads, and that the memory has accepted the data on 
writes. 


The on-chip caches have a line size of 32 bytes. With no 
external pipelining and a memory latency of six cycles, a 
cache line fill (four 8-byte fetches) takes 24 cycles to com¬ 
plete. However, by using external bus pipelining, a cache 
fill takes only 12 cycles with the same external memory la¬ 
tency, which represents a 50 percent faster cache fill. 

External bus optimizations. The processor contains a 
comparator to drive an output pin called NENE# (next-near). 
This pin provides support for fast page mode and static col¬ 
umn DRAMs. Such DRAMs can access data in the same page 
(typically 1 Kbit) in half the time required by access to a 
different page. For example, an 80-ns static column DRAM 
requires only 40 ns for a “near” read. 

Actually, “next near” is a misnomer because the pin indi¬ 
cates previous near and is not capable of predicting the 
future. The NENE# pin indicates that the current memory 
access falls on the same DRAM page as the previous access. 
Consecutive accesses to the same page occur frequently. 
For example, cache line fills and write-backs (32 bytes) con¬ 
sist of three near accesses after the first access of the line. 
Figure 8 illustrates that RAS toggles only once for the four 
transfers of a cache line. To compensate for the inevitable 
increase in DRAM chip capacity, the page size used for NENE# 
derivation is software programmable in an on-chip control 
register. 
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8-Kbyte data cache 


Way 0 Way 1 



Figure 9. Internal caches with the widths of fields shown in bits. 


Internal caches. The i860 processor uses split instruction 
and data caches (Figure 9) to ensure high bandwidth, as do 
the Mips R3000 and Motorola 88000 and 68040 processors. 
The 1 6 -byte data path and 8-byte instruction bus at 40 MHz 
yield 960 Mbytes/s of access. 

The feasibility of implementation on one chip in 1988 drove 
the design decisions of a 12-Kbyte size, two-way associativ¬ 
ity, and replacement algorithm of the caches. Of course, the 
size and associativity are larger in the next-generation i860 
XP CPU. To avoid the complexity required for least recently 
used (LRU) replacement, designers employed a pseudoran¬ 
dom algorithm to choose which of the ways to replace upon 
a miss. The high bandwidth and low latency (due to their 
locations on chip) offset the small size and replacement, as 
performance measurements show. 

Virtual addressing. The virtually addressed caches re¬ 
quire attention by operating systems programmers but are 
transparent to applications. A cache of this type uses a virtual 
address, rather than a real address, to select a line of data. 
Two advantages arise from virtual addressing in the hard¬ 
ware design: 

• The cache access is very fast because the TLB (address 
translation look-aside buffer) lookup is not needed prior 
to the cache tag comparison. Translation of the physical 
address occurs in parallel with cache access and tag com¬ 
pare, not serially. 


• The cache can grow in future imple¬ 
mentations to greater than 4 Kbytes 
per way. Four Kbytes/way is the limit 
of most physical caches, because 
only the 12 least significant bits of 
address remain unchanged in the 
virtual-to-real translation with 4- 
Kbyte pages, which the i860, 386, 
486, and most RISCs use. 5 Physically 
addressed caches usually access the 
data and tag using those 12 bits, 
while the upper 20 bits of the ad¬ 
dress are undergoing translation in 
the translation look-aside buffer. 


Virtually addressed caches contain ad¬ 
dressing information that is obsolete 
whenever the operating system changes 
the translation tables. Thus the operating 
system must flush the caches during a 
context switch or when a memory page 
is swapped out to disk. This flush takes 
30 (is (or less—50 (is is the worst case 
value when the entire cache is dirty, or 
modified) at 40 MHz, which decreases 
overall performance by 1 percent when 
the CPU frequently switches contexts (250 per second). 

Virtual tags also complicate aliasing of data that are writ¬ 
able in an associative cache. Aliasing in this sense means two 
different virtual address values refer to the same physical 
address, so that two copies of the same datum might appear 
in both ways of the two-way associative cache. For the case 
of a write to an aliased location, only one copy in the cache 
would be updated. 

The second-generation 
i860 XP CPU executes 
hardware snooping. 

Some operating systems use aliasing to share data or code 
between cooperating processes. Note, however, that read¬ 
only data and instructions can be aliased; if software requires 
writable data aliasing, it can make the cache behave as if it 
were a 4-Kbyte, direct-mapped cache by setting a bit in a 
control register and sacrificing some performance. 

Performance optimization of caches. Several design 
techniques increase performance of the data cache. 6 First, 
the cache behaves in a write-back or copy-back fashion, in 
which stores to cached locations do not show up on the 
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(c) 


ADS External bus address strobe 
C Cache access 
RDY External bus ready signal 


Figure 10. Load pipeline: Instruction after load does not use loaded 
data (a); instruction after load with loaded data (b); and load in¬ 
struction with data-cache miss (c). W* writes occur in a later clock 
cycle when the register file is free. 

external bus. Modified data moves off the chip only when 
the cache line containing it must be replaced by a more re¬ 
cently requested line. If the alternative (a write-through cache) 
were used, the external bus would degrade performance in 
operations on large data arrays. Such operations include graph¬ 
ics transforms, matrix multiplies, fast Fourier transforms, and 
other digital signal processing functions. 

In multiprocessors, write-back caches have a consistency 
problem when one CPU modifies data in its cache, unbe¬ 
knownst to other CPUs. A write-through approach always 
updates main memory and other processors’ caches. The write¬ 
back approach leaves cache consistency management to soft¬ 
ware, or to “bus snooping” hardware. The i860 XR CPU does 
not contain snooping hardware. Thus, i860 XR multiproces¬ 
sor cache consistency requires software to avoid caching 
shared data, but the write-back approach improves bus traf¬ 
fic and performance for uniprocessors. The second-genera¬ 


tion i860 XP CPU executes hardware snooping. 

When write-backs do occur, they may require 
only two bus cycles. Since each cache line con¬ 
tains 32 bytes, writing an entire line would take 
four transfers on the 8-byte bus. However, the pro¬ 
cessor writes only half a line, when only half has 
been modified. It does so by keeping two dirty- 
status bits per line. 

The on-chip cache makes load data available in 
the second instruction after the load, with only one 
delay slot. When the data is used in the very next 
instruction after the load, hardware makes the in¬ 
struction wait for the data, as shown in Figure 10b, 
(unlike the Mips R3000, which depends on the com¬ 
piler or programmer to avoid using the data in that 
next instruction). 7 Figure 10 shows the load pipe¬ 
lining, which can be abbreviated FDACW (fetch, 
decode, address, cache access, write). Floating-point 
loads, while using the same FDACW pipeline, face 
two delay slots because the floating-point data is 
not “bypassed” from the cache directly to the ex¬ 
ecution units. The large number of floating-point 
data paths contributed to this decision. 

The difference in length between the load pipe¬ 
line (3 stages) and arithmetic execution (4 stages) 
can yield conflicts in use of the one integer-regis¬ 
ter file write port. For example, in Figure 10a the 
LD and ADDU instructions both need to write re¬ 
sults in clock cycle 4. This conflict gets resolved 
without performance loss by delaying the load- 
write until a later time when no other instruction 
needs the write port. That time occurs in cycle 3 in 
the figure, a branch’s fourth stage. The processor 
behaves as if the load write occurred at the normal 
time by feeding the load-data input register to any 
later instructions referencing it. 

Accesses to the cache by the load instruction LD are fully 
scoreboarded: an internal state machine saves the destina¬ 
tion register tag and checks that subsequent instructions do 
not use the same register. Scoreboarding allows execution to 
continue until the destination is referenced by a later instruc¬ 
tion (Figure 10c). However, floating-point load (FLD) cache 
misses are not scoreboarded, because the large number of 
paths possible to incoming data in the floating-point registers 
made a floating-point scoreboard too costly. Thus a cache 
miss on FLD freezes the CPU until the first datum is available. 

Stores are posted into either of two on-chip write buffers, 
allowing execution to continue despite write cache misses. 
The two write buffers, at 16 bytes each, can hold a modified 
line for a write-back that results from a replacement. They 
allow the line fill of requested data to occur without waiting 
for the write-back of the line it displaces. Furthermore, all 
cache line fills wrap around, so that the first instruction or 
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i860 performance 


Id.l iterations, counter_reg 


fld.d array_pointer++, freg 

//get datum 

loop: 


adds -1,counter_reg,counter_reg 

//decrement counter 

fadd.ss freg,sum,sum 


bnc.t loop 

//Delayed branch 

fld.d array_pointer++, freg 

//Delay-slot instruction 

(a) 

Id.l iterations, counter_reg 


pfld.d array_pointer++, fO 

//prime the 3-stage pipeline 

pfld.d array_pointer++, fO 

// 

pfld.d array_pointer++, fO 

// 

pfld.d array_pointer++, freg 

//get first datum into freg 

loop: 


adds -1,counter_reg,counter_reg 


fadd.ss freg,sum,sum 


bnc.t loop 


pfld.d array_pointer++, freg 


(b) 


Figure 11. Using the PFLD instruction: Code to sum a floating-point ar¬ 
ray (a) and similar code using PFLD (b). 


datum the processor sees on a miss is the needed one. The 
other three transfers of the line occur in the background as 
the processor resumes execution. After they complete, an 
accompanying write-back can occur in the background. 

During instruction cache line fills, the CPU can continue 
executing instructions as they are fetched, rather than wait¬ 
ing until the end of the line transfer. Because the 8-byte ex¬ 
ternal bus transfers two instructions every two clock cycles, 
the CPU can copy the input register into the instruction reg¬ 
ister for execution. 

To ensure speedy data cache flushes, designers imple¬ 
mented a FLUSH instruction. It empties the cache more than 
twice as quickly as the replacement operations described 
earlier. FLUSH causes modified data to be copied to the 
external bus and overwrites the tag directory with an in¬ 
nocuous value. While a normal cache-missing load instruc¬ 
tion also would cause a dirty data dump, it would waste 
four bus cycles, bringing a new line into the cache. The 
FLUSH instruction must be executed once for each of the 
256 cache lines, taking 50 (is total at 40 MHz in the worst 
case when the entire cache is dirty but only 25 (is when no 
lines are dirty. 

Bypassing the cache. To allow the programmer to con¬ 
trol which data gets into the cache, the processor provides a 
novel instruction, PFLD (pipelined floating-point load). The 
processor does not cache items fetched using PFLD. It inten¬ 
tionally bypasses the data cache to avoid thrashing, or dis¬ 
placement of still-needed cache data by newly fetched data. 


Programmers can also make data noncacheable 
via the page-table entries or by external hardware 
decoding addresses to deactivate the cache en¬ 
able (KEN#) pin, but these do not offer the opti¬ 
mizations found in PFLD. 

The pipelined part of PFLD allows it to tolerate 
the increased latency of external memory com¬ 
pared to on-chip cache hits. The PFLD instruction 
returns data from the address generated by the 
third previous PFLD. This behavior enables the 
i860 processor to maintain the full bus bandwidth 
while accessing large data sets that reside in ex¬ 
ternal memory, even if the latency of the memory 
is six clock cycles or greater. That is, the seman¬ 
tics of the instruction allow the program to issue 
the data request to memory many clocks before 
the resulting data is demanded. PFLD works simi¬ 
larly to the pipelined add PFADD and multiply 
PFMUL instructions. The destination register re¬ 
ceives data from the end of a three-stage pipeline, 
rather than the result from the operation initiated 
using the two source registers specified in the 
instruction. 

PFLD uses an on-chip, three-stage data FIFO 
(first-in, first-out buffer) situated between the data 
pins and the internal data bus. The i860 uses the FIFO only 
for fast external memory systems, which return data for PFLD 
requests before the subsequent PFLD instructions (which use 
the data) are encountered. Here “fast” is a relative term, as 
the sparse occurrence of PFLD in most programs makes any 
memory fast when compared to the long interval between 
PFLDs. Data flows directly from the pins to the destination 
register when the FIFO is empty, avoiding any extra delay. 
Thus the three stages of the PFLD pipe can occur either in¬ 
side or outside the chip (as in Figure 7) or a combination of 
both. 

Using a normal load for vector operations can result in cache 
thrashing when an entire cache line is fetched to reference 
one word of data one time only. In this case, the data cache 
becomes a liability demanding bus bandwidth unnecessarily. 
PFLD excels at processing large amounts of data from a slow 
memory system and allows mixing of fetches from the internal 
cache with external fetches. Furthermore, the chip maintains 
coherence for PFLD by checking the cache for the data re¬ 
quested, just in case the program has already loaded the data 
into cache using normal LD or FLD instructions. 

PFLD turns out to be easy to use in programs. It is handy 
for loops that access data arrays, and for memory-block moves 
common in graphics and operating-system code. The data 
does not have to be floating point, and programs can effec¬ 
tively use PFLD on integers and pixel data. In fact, the “float¬ 
ing” in PFLD merely refers to the destination, the floating-point 
register file. To convert normal loads to pipelined loads, the 
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Manufacturer 

DG Avion 

Sun 

IBM RS/6000 

MIPS 

Intel 

IBM RS/6000 

HP 9000 

Model 

MC88000 

Sparcstation 2 

Model 320 

RC3360 

i860 station 

Model 540 

Model 720 

Clock rate 

33 MHz 

40 MHz 

20 MHz 

33 MHz 

40 MHz 

30 MHz 

50 MHz 

Price 


$18,000 

$18,000 

$62,000 

$21,000 

$100,000 

$34,500 


Figure 12. Specmarks and approximate prices for workstations (March 1991). 


programmer merely includes three dummy PFLD instructions 
using the first three data addresses before the beginning of 
the loop (Figure 11). The internal FIFO takes care of the rest. 8 

Performance measurements 

Good performance from a computer system depends on 
an efficient instruction set; its compilers, software libraries, 
and operating system; and a fast hardware implementation. 
However, evaluations should include dollar costs to have 
real meaning, as a Cray supercomputer outruns any worksta¬ 
tion but is not cost-effective for many applications. 

Of course, benchmarks may not match the real-world ex¬ 
perience of the computer builder or buyer. But, given that 
every buyer cannot test all machines on a particular program 
suite, standardized benchmarks have become necessary in¬ 
put to buying decisions. The well-regarded SPEC benchmark 
from the System Performance Evaluation Cooperative 9 uses 
10 programs to measure CPU speed. With four C and six 
Fortran programs, it covers a wide range of scenarios. The 
Fortran programs use floating-point, making them appropri¬ 
ate for engineering users. SPEC does not test I/O perfor¬ 
mance extensively, and compiler quality can make or break 
SPEC performance. Figure 12 shows Specmarks for an i860 


processor system under Unix Version 4.0 and other similarly 
priced RISC workstations, each at the maximum clock rate 
currently offered. Specmarks measure the speedup over a 
VAX 11/780, so higher Specmarks are faster. 

Most RISC processors at a given clock frequency (33-50 
MHz are maximum rates at the time of this writing) score 
similar SPEC results, because their instruction sets, caches, 
memory systems, and compilers are similar. However, Mips, 
Hewlett Packard, and IBM RS/6000 machines excel because 
of excellent compilers and RS/6000 superscalar execution, 10 
and the i860 excels because of fast floating-point operations. 

Smaller benchmarks like the Whetstone and Dhrystone fit 
into the on-chip cache and exploit the 16-bytes-per-clock 
bandwidth of the cache for text-move operations. Likewise, 
the fast Fourier transform, which is ubiquitous in digital sig¬ 
nal processing, proves extremely fast on the i860 CPU, mak¬ 
ing it popular in signal processing and graphics applications. 11 
For the FFT, the i860 achieves nearly two instructions each 
clock cycle and 52 Mflops at 40 MHz (Table l). 12 

Next-generation i860 CPU 

The i860 XP microprocessor uses more than 2.5 million 
transistors in a submicrometer process to enhance perfor- 
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i860 performance 


Table 1. FFT performance for the fastest chips 
available in 1990. 12 FFT benchmark is 
1,024-point, single-precision, complex-input. 


CPU 

Clock rate 
(MHz) 

FFT time 
(ms) 

Intel i860 

40 

0.74 

Tl TMS320C30 

33 

3.4 

AT&T WEDSP32C 

50 

2.8 

Motorola 96002 

33 

1.13 


A 1,024-point fast Fourier transform equals approximately 
16,000 multiply and 29,000 add/subtract operations. 


mance with new features and 32 Kbytes of on chip caches. It 
is i860 software binary-compatible. With additional registers, 
larger caches, a burst bus, and a 50-MHz clock rate, the CPU 
doubles the performance of the original i860 XR processor. 
To allow system performance to grow faster than clock rates, 
it also implements multiprocessor cache consistency and syn¬ 
chronization primitives. Intel introuced the i860 XP and ac¬ 
companying second-level cache chips in June 1991. 


AS A NEWER ENTRY IN THE MICROPROCESSOR RACE, the 
i860 architecture incorporates the state of the art in perfor¬ 
mance-seeking ideas. Novelties in the i860 CPU include si¬ 
multaneous floating-point operations similar to digital signal 
processing, a two-instruction-per-clock mode, fast floating¬ 
point pipelines, graphics instructions, and high-bandwidth 
registers and caches on-chip. These features make it one of 
the fastest single-chip processors available, popular especially 
with graphics and supercomputer builders. 

By including these features, the i860 architecture will ex¬ 
ploit future advances in compiler technology; an architecture 
that does not offer explicit parallelism opportunity can ex¬ 
pect performance growth only through faster clock rates and 
caches. (P 
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continued from p. 35 

which also involves a consid¬ 
erable amount of money. Ex¬ 
perts predict that it will take at 
least another 10 years for Ja¬ 
pan to catch up with Ameri¬ 
can MPU technology. 

Techno Japan is published monthly by 
Fuji Technology Press Ltd. in Tokyo. 

MITI programs 

MITTs Intelligent Production System 
(the IPS 10-year project, US$1.1 billion) 
plans participation from DEC, 
Rockwell, United Technology, Boeing, 
General Motors, IBM, Kodak, and 
Texas Instruments. Feasibility studies 
started in February. 

MITI plans to expand its multi¬ 
language machine translation system 
development project to include Euro¬ 
pean languages. This is a six-year 
project ($150 million) begun in 1987 
with participation of the Chinese, Thai, 
Malaysian, and Indonesian govern¬ 
ments. The project uses an intermedi¬ 
ate language and scheduled its first 
seminar with US and European Com¬ 
munity governments last July. MITI 
plans call for an electronic dictionary 
and a prototype by 1995 and transla¬ 
tion centers established in Asian coun¬ 
tries in 1993-1995. 

MITI’s Basic Technology Center will 
conduct research on network-based 
learning and develop a system for pro¬ 
viding interactive learning environ¬ 
ments to network users. Some new 
companies will be funded to accom¬ 
plish this (about US$30 million). 

Software structuring 
program 

MITI has upgraded the New Software 
Structure Model to the status of an in¬ 
ternational research project and set up 
two special research groups in the Japa¬ 
nese Information Processing Associa¬ 
tion, which began activities in October 


1990. The purpose of this project is to 
develop a system in which software 
will be flexible enough to deal with 
errors and other changing external con¬ 
ditions. The project is an industry-uni¬ 
versity cooperative program. Overseas 
participants include Stanley Peters from 
Stanford University and Joseph Gogian 
from Oxford University. 

Production systems 

The company’s industrial structure 
is changing into multiple levels. In par¬ 
ticular, experts expect a shift into a ter¬ 
tiary sector centered around service to 
continue. Economic and social condi¬ 
tions are moving from quantitative ex¬ 
pansion to qualitative improvements. 
Market needs will force manufacturing 
industries to change from large quan¬ 
tities of a few products to variable quan¬ 
tities of different products. Manpower 
and labor shortages have already ap¬ 
peared. We’ve seen a noticeable trend 
among science/engineering people to 
stay away from manufacturing. The 
country has an urgent need to stream¬ 
line facilities using mechatronics (com¬ 
puters), and to shift from streamlining 
individual facilities to entire manufac¬ 
turing systems. The focus of business 
must be shifted from “things” to infor¬ 
mation.” 

Specifications and quantities of prod¬ 
ucts will become individualized. So, 
production systems should be able to 
provide products based on unique 
ideas. These systems must freely mul¬ 
tiply or shrink and even change func¬ 
tions. In addition, quick delivery after 
order receipt is needed, as are distrib¬ 
uted databases that are immediately 
updatable. Production systems with 
high fault-tolerance and automated 
functions (monitoring, malfunction di¬ 
agnosis, restoration) are needed to carry 
out unmanned operations. Production 
systems must be flexible to minimize 
remodeling expenses and time, as well 
as the chances of the systems becom¬ 
ing obsolete as a result of product/ 
model/schedule changes. Intelligent 
human interface support systems are 


needed. 

The most important tasks include 
product design systems (CAE/CAD/ 
CAM) and production process design 
systems (computer assembly to man¬ 
age CAD/CAM). The country has a high 
demand for advanced supervisory sys¬ 
tems, though it rates automated resto¬ 
ration systems lower. Immediate goals 
are in mechanical systems for diagnos¬ 
tic support. Expected by 1995-1998, and 
rated higher by manufacturers than by 
users, are automating visual observa¬ 
tions and other sensing tests. In addi¬ 
tion, design operations need to be 
standardized. Plans call for develop¬ 
ment of a comprehensive CAD/CAM/ 
CAE system by 2000-2005. 

Production engineering know-how 
must be transformed into databases to 
bring about widespread use by 2000. 
Production management systems, 
though developed early, will be af¬ 
fected by the development of databases 
and expert systems. An automated as¬ 
sembly system should be capable of 
copying an engine by 1995-1998. 

Intelligent robots face problems in 
building other robots and interfacing 
with peripherals. Developments in ar¬ 
tificial intelligence, such as 3D recog¬ 
nition, and neural/fuzzy systems are 
expected by 1995-2000. 
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2 IEEE Micro 


On an open architecture 

To the Editor: 

In the October 1991 issue I particularly liked 
“A RISC Processor for Embedded Applications 
Within an ASIC,” “Performance and the i860,” 
Micro News, Micro Law, and Software Report. 

One criticism: In “On the Edge” about IEEE 
Std PI754, the PI754 architecture is compared 
to several commercially available microproces¬ 
sors in Table 1. This table is preposterously mis¬ 
leading. It is clearly biased toward the PI754 
architecture. For example, it claims that there 
are “many” performance grades of PI754 but 
“few” of the 68000 architecture, and states that 
those are “mainly different clock speeds.” 

The author describes different performance 
grades as differing in cache implementation, 
external bus structure, basic technology, etc. 
Without even looking up references, I can cite 
the following variants on the 68000 architecture 
which differ in these respects: 

68008—8-bit data bus, 20-bit address bus; 

68000—16-bit data bus, 24-bit address bus; 

68HC000—CMOS; 

68020—256-byte instruction cache, extra ad¬ 
dress modes, 32-bit address and data buses, dy¬ 
namic data bus sizing; 

68030—separate 256-byte instruction and data 
cache, MMU; 

68040—4K separate instruction and data cache, 
MMU, FPU, bus snoops; 

68331—added telecom features, also DMA and 
other features; 

68340—essentially 68332 but with simple 
counter/timers; and 

68EC040—68040 without FPU, MMU. 

Table 1 also claims that the 68000 architecture 
has only one implementer. I believe the Phillips/ 


Signetics SCC68070 “highly integrated micropro¬ 
cessor” is a contradicting example. 

I hope that Micro will be more successful in 
the future at ensuring that its content is accurate 
and not overtly biased. 

Joseph M. Schachner 
Spring Valley, NY 

Reply: 

Thank you for your comments regarding the 
PI754 article I wrote. I agree that my table may 
have been a bit misleading. 

The point I was trying to make was not that 
other architectures don’t have variations but that 
they have very limited variations based on avail¬ 
able devices. The implementation of P1754 is 
open. PI754 may be implemented as a low-per¬ 
formance, low-cost device or as a high-perfor¬ 
mance superscaler. 

Even so, I believe the 68K family itself is large, 
and each architecture variation (68000, 020, 030, 
040, ...) is unique. P1754 is based on the Sparc 
family of RISC processors. This family also has 
different variations: Version 7, P1754, Version 8. 
Picking one variation of the Sparc family that 
has been made a standard and comparing it to 
one member of the 68K family would be a more 
appropriate comparison. 

You see, the concept of PI754 is very different 
from the 68K family (for that matter, from most 
microprocessor architectures). PI754 defines an 
Instruction Set Architecture (ISA) and not a de¬ 
vice. And as such, the ISA is mostly flexible in 
performance grades and implementations (i.e., 
cache, external bus structure, basic technology). 

Again, it was not my intention to insinuate that 
other architectures (specifically 68K, 88K, i860) 
are not as good or less powerful but to point out 
how different the definition of PI754 is. 

Rudolf Usselmann 
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(LK: liked, DLK: disliked, LTS: like to 
see) 

Many readers continue to write 
about the split articles. They get the 
answer in this issue. From this ex¬ 
periment, we learned that readers do 
care about Micro. Thanks, and please 
continue to write and to care!— 
D.D.C. 

December 1990 

LTS: I suggest you send all the sub¬ 
scribers the index that appears in the 
issue of December of each year on a 
floppy disk.—-J.L.B., Montevideo, 
Uruguay (Your suggestion sounds 
reasonable to me. If it proves cost- 
effective, I would propose it.— 
D.D.C.) 

LK: Your articles on the ESPRIT 
project; what ... about the Japanese 
project for the fifth-generation com- 


ln the mailbag 

puter? LTS: Articles (cluster of articles) 
on other projects like those under the 
ESPRIT umbrella (even from Ja¬ 
pan....)—B.M., Timisoara, Romania 

February 1991 

LK: "The Drive to the Year 2000"— 
F.P.B, Laurel, MD 

DLK: The new practice of breaking 
up the articles into widely separated 
fragments is very poor.—S.L.P., Darien, 
CT 

April 1991 

LK: Analysis of multicomputer/ 
multiprocessor systems; DLK: jumping 
across articles: LTS: the architecture of 
MIPS R4000, FPU MC68882.—K.V., 
Bangkok, Thailand 

DLK: Your new magazine format is 
terrible!—K.R., North Amherst, MA 
LK: The article on communication 
latency; it was well presented; LTS: 


more articles analyzing the architec¬ 
ture/ performances of various ma¬ 
chines.—A.S.M., Secunderabad, 
India 

LK: “The end of the 386 mo¬ 
nopoly”—the reportage style was 
clear and interesting; LTS: in the Mi¬ 
cro Law column, an article on patent 
law changes with respect to micro 
applications.—D.S., Philip, Australia 

LK: Micro News, Micro Review, 
Micro View, Micro Law, New Prod¬ 
ucts. I wait for “Hot Chips.” DLK: 
Hated: split articles! Yes, I have read 
the editor’s reply on p. 2. Personally, 
I would prefer no color page at all to 
a split, (sliced and mixed) magazine. 
I’m sorry to say that.—T.P., Warsaw, 
Poland 

June 1991 

LK: Micro News... W.M.S., Atlanta, 
GA 


NEW Conference Proceedings 

fmm IEEE Computer Society Press 


12th real-time systems symposium 

The proceedings covers state-of-the-art research and key developments in 
real-time computing and discusses issues such as hard real-time 
communications, priority scheduling, object-oriented modeling, optimization, 
real-time databases, requirements specification models, and adaptive real¬ 
time systems. Its articles discuss new research on improvements in the 
construction of real-time systems and to enhance the growth in real-time 
computing R&D. 

320 PAGES. DECEMBER 1991. SOFTBOUND. ISBN 0-8186-2450-7. 

CATALOG NO. 2450 $70.00 MEMBERS $35.00 


VLSI DESIGN 1992 - 

5TH INTERNATIONAL CONFERENCE ON VLSI DESIGN 

The proceedings of VLSI Design ‘92 provides an extensive discussion of the 
latest advances in technology and explores recent technical opportunities in 
electronic design automation, and consists of over 80 papers on advanced 
test methods, physical design, design and fabrication, logic and high-level 
synthesis, VLSI tools and technology, testing, VLSI architectures, control 
and data path synthesis, signal processing applications, design for testability, 
layout, verification, industrial applications, and design verification. 

400 PAGES. JANUARY 1992. SOFTBOUND. ISBN 0-8186-2465-5. 

CATALOG NO. 2465 $90.00 MEMBERS $45.00 


8TH TR0N PROJECT SYMPOSIUM 

The TRON Project Symposium features four sessions providing detailed 
explanations of ITRON (industrial), BTRON (business), CTRON (central and 
communications), and TRON-specification VLSI CPU. It includes 19 articles 
covering implementation and application issues, and explores new specification 
proposals for next generation products based on the TRON architecture. 

In addition, it investigates the current efforts by researchers to develop a 
standardized and open architecture covering the area from operating system to 
VLSI chips. 

264 PAGES. NOVEMBER 1991. SOFTBOUND. ISBN 0-8186-2475-2. 
CATALOG NO. 2475 $60.00 MEMBERS $30.00 


1ST INTERNATIONAL CONFERENCE ON 
PARALLEL AND DISTRIBUTED 
INFORMATION SYSTEMS 

This book includes over 40 articles on current applications, and is divided into 
12 sections that investigate information systems technologies such as disk 
replication and arrays, object-oriented systems, joins, parallel logic, transitive 
closure and synopsis, retrieval, distributed systems, architectures/parallel 
processing, file systems, shared memory systems, and shared nothing systems. 

320 PAGES. DECEMBER 1991. SOFTBOUND. ISBN 0-8186-2295-4. 
CATALOG NO. 2295 $64.00 MEMBERS $32.00 
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Send information for inclusion in Micro News 
one month before cover date to Managing 
Editor, IEEE Micro, PO Box 3014, Los Alamitos, 
CA 90720-1264. 


Intel sues AMD for 

PLA copyright infringement 

Richard H. Stern, Contributing Editor 

The first significant copyright infringement suit 
based on a hardware device, other than code 
stored in a ROM chip (see box on copyright 
cases), was filed in federal court in Austin, Texas, 
on October 9, 1991. Intel Corp. sued Advanced 
Micro Devices for infringing on a copyright in 
what Intel identified as a “computer program 
stored in PLA [programmable logic array] of 80386 
microprocessor” chip. 

(In another part of Intel’s complaint, it accused 
AMD of infringing on an Intel copyright in the 
microcode of the same chip. Intel and AMD are 
also engaged in patent, copyright, and mask work 
infringement litigation in federal court in San Jose, 


California, over 80286 and 80287 chips. In addi¬ 
tion, they have been engaged in an arbitration 
proceeding, with related litigation, over their chip 
technology interchange and second-sourcing 
agreements.) 

Control program. Intel’s complaint against 
AMD states that the 386 PLA contains the “con¬ 
trol program” for the 386 DX and SX chips. Intel 
did not describe the function of the so-called 
control program (the company’s designation; it 
is not a recognized term of art), but it appears 
that this PLA acts as a decoder for macro instruc¬ 
tions. That is, an assembly-code instruction, such 
as “Move contents of register A to register B,” 
may be executed by five or more steps repre¬ 
sented in microcode. The microcode ROM of a 
microprocessor such as the 386 stores the in¬ 
structions for these steps, as well as the steps for 


PLA and ROM copyright cases 


In 1987 Alloy Computer Products brought 
two suits in federal court in Los Angeles, charg¬ 
ing copyright infringement of an alleged com¬ 
puter program stored in a PLA device. 12 
Apparently, these cases never came to trial, 
perhaps because the defendants surrendered 
without contesting the claim. No reported prec¬ 
edent on copyright in PLAs has emerged from 
these suits. 

Federal case law has established that oper¬ 
ating system software stored in ROM is sub¬ 
ject to copyright protection. 3 This principle 
apparently extends to microcode sorted in 
ROMs in microprocessors. However, the one 
decision on that issue concluded that the par¬ 
ticular microcode involved (8086/8088) was 


not infringed because the accused microcode 
(V20/V30) was too different to be found “sub¬ 
stantially similar” (the test for copyright infringe¬ 
ment). 4 


References 

1. Alloy Computer Prods, v. Ultratek Corp., No. 
87-6993 (C.D. Cal. filed Oct. 20, 1987). 

2. Alloy Computer Prods, v. Asadi, No. 87-1285 
(C.D. Cal. filed Mar. 23, 1987). 

3. Apple Computer, Inc. v. Franklin Computer 
Corp., 714 F. 2d 1240, (3d Cir., 1983); cert, 
dismissed by stip., 464 U.S. 1033 (1984). 

4. NEC Corp., v. Intel Corp., 10 USPQ 2d 1177 
(N.D. Cal. 1989). 
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other instructions. 

What Intel describes in its complaint 
as the control program may therefore 
be a symbolic notation for a decoder 
PLA that receives as input the object 
code for an expression such as, “Move 
contents of register A to register B.” 
The decoder PLA then provides as out¬ 
put a starting address in the microcode 
ROM for the relevant five or more steps. 
Such a decoder would be, essentially, 
an array of gates (such as And and Or 
gates) corresponding to a set of Bool¬ 
ean equations for transforming instruc¬ 
tions of the sort described (placed into 
object code format) into appropriate 
addresses of the microcode ROM. 

PLA contents are usually specified 
symbolically. This symbolic represen¬ 
tation is presented to the device that 
creates the hardware representation. 
Since the symbolic representation 
describes the relationship between 
inputs and outputs of the PLA, essen¬ 
tially Intel is asserting a copyright in 
what the PLA does, rather than the 
specific way it does it. Query: Can this 
be meaningfully distinguished from a 
copyright in a source code, which is 
copyrightable? 

There are no close legal precedents 
on whether information in this form is 
a “computer program,” as that term is 
defined in the Copyright Act. (Section 
101 of the Copyright Act defines a com¬ 
puter program as a set of statements 
or instructions used to bring about a 
result in a computer.) 

There are also no close legal prece¬ 
dents on whether making and selling 
a physical device—for example, a 386 
clone—containing a substantially simi¬ 
lar array of gates infringes on a copy¬ 
right in the information describing the 
array. Such copyrights are registered 
with the Copyright Office by filing the 
information in some sort of printout 
(perhaps the Boolean equations, or an 
object code printout in binary or Hex 
of the microcode addresses tabulated 
against the object code inputs). Query: 
Is the physical device a legally pro¬ 
tected “copy” of what was registered? 


Semiconductor Chip 
Protection Act 

The US Copyright Office’s refusal 
to issue copyright registrations for 
semiconductor chip layouts led to 
passage of the Semiconductor Chip 
Protection Act of 1984. This law 
covers layouts of unprogrammed 
PROMs—programmable ROMs— 
but apparently does not cover lay¬ 
outs, or mask works, of pro¬ 
grammed PROMs blown from such 
unprogrammed PROMs. It also cov¬ 
ers layouts of both personalized 
and unpersonalized gate arrays. 
Probably, it would cover the lay¬ 
out of the PLA involved in the Intel 
v. AMD case. But a functionally 
similar PLA with a different layout 
from the registered layout would 
not be covered by such a 
registration. 


Did the Copyright Office know? 

Intel attached the copyright registration 
certificate (No. TX 3 121 803) for its 
PLA to its complaint. Such certificates 
include a “correspondence” box that 
the Copyright Office checkmarks if 
correspondence occurred between the 
office and the registrant. The box on 
this certificate is not checked. 

That omission may suggest that the 
Copyright Office rubber-stamped the 
application without realizing that a PLA 
was involved (assuming that the Copy¬ 
right Office even knows what a PLA 
is), and without realizing that the com¬ 
puter program in question was perhaps 
imaginatively so designated. One might 
expect the Copyright Office, if it knew 
that a gate array was being claimed as 
an information storage device, would 
ask for an explanation. 

Further, considering the Copyright 
Office’s history of resistance to copy¬ 
rights on hardware, (see box on Semi¬ 
conductor Chip Protection Act) one 
might expect it to issue gratuitous 


advice to Intel that the copyright did 
not extend to physical devices built in 
accordance with or embodying the 
information. It is common practice for 
the Copyright Office to do so, and to 
place a record of the advice, in the form 
of a letter, in the official file. 

None of the foregoing suggests that 
an explanation about the PLA, if Intel 
had given one to the Copyright Office 
in response to such a demand, would 
necessarily have been unsatisfactory. 
There is a great deal of isomorphism 
between Is and Os in a ROM and the 
equivalent in a PLA. One device Ands 
sets of Or gates, while the other 
device Ors sets of And gates. It may 
well be that analogous assembly codes 
can be created and described for each 
device. 

Both such codes (if they are accepted 
as being codes) might be compiled or 
assembled into Is and Os (connections 
and open circuits) in analogous ways. 
But one would have expected the 
Copyright Office to ask for such an 
explanation if it understood what was 
going on in the case of this copyright 
application. Perhaps AMD’s eventual 
response to Intel’s claims will shed 
more light on these issues. 

If PLA programs are held protectable 
by copyright, that will have startling 
and important implications for design¬ 
ers of electronic circuitry. (For a fur¬ 
ther discussion of the issue, see Micro 
Law, “Field-programmable logic 
devices—Are they hardware or soft¬ 
ware? Can their programmed configu¬ 
rations be protected against copying?” 
in IEEE Micro, Vol. 6, No. 5, Oct. 1986, 
pp. 61-62 and 78.) 
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E s could be expected, demands of the 
user community and developments in 
computer hardware and software have 
affected database technology. Al¬ 
though, traditionally, database practitioners 
worked with well-structured or semistructured da¬ 
tabases, the development of new and reliable 
sensing techniques added impetus to research of 
unstructured databases such as those containing 
images. 

These trends focused interest on unconven¬ 
tional database applications, despite their very 
large data volumes. For example, a 150-bed hos¬ 
pital generates 2 Gbytes of image data each day, 
and the US National Aeronautics and Space 
Administration’s planned Earth Observing System 
program could generate several Gbytes of data 
every day. The archiving, retrieval, and manage¬ 
ment of such large databases constitute a com¬ 
plex and difficult task. Recent trends in database 
technology suggest that database computers can 
best undertake the efficient management of large 
databases. That is, specialized hardware and soft¬ 
ware configurations aimed primarily at handling 
large databases and answering complex queries 
become the best solution. 

In general, most of the data modeling effort fo¬ 
cuses on three widely used, structured database 
models: relational, hierarchy, and network. Other, 
recently proposed object-oriented data models 
may be suitable for modeling databases that in¬ 
clude structured and unstructured data. Since the 
theoretical underpinnings of the relational model 
have been well defined, the relational model be¬ 
came the focus of most of the commercial and 
research efforts in database machines. Some addi¬ 
tional effort in the development of hardware solu¬ 
tions for semistructured databases like text 
retrieval also surfaced. The articles in this issue 
reflect these database machine trends. Four of 
them discuss issues related to relational databases, 


and another delves into text retrieval. 

The primary motivation of the database ma¬ 
chine approach is to increase the performance of 
updating operations and the query processing of 
databases. One approach taken to achieve this 
objective employs parallel architectures. The de¬ 
signer must choose the appropriate approach for 
data distribution and storage, interprocessor com¬ 
munications, and computational capability of each 
node. We can divide the parallel architectures 
studied in the context of database processing into 
three groups: shared memory, shared disk, and 
“shared nothing.” 

The current trend in the design of database 
computers is toward the increasingly popular 
shared-nothing 1 message-passing architectures. 
Designers usually prefer these architectures over 
shared-memory and shared-disk architectures be¬ 
cause they permit a high degree of scalability. The 
MDBS database computer described by Hsiao in 
this issue belongs to this category. 

Database engines 23 represent another example 
of shared-nothing database systems. The high- 
performance and fault-tolerant database engines 
generally support a server/client architecture. A 
number of new database engine products have 
already hit the market, and more are currently 
under development. Teradata’s DBC/1012 uses 
80386 processors (up to 1,024). Because proces¬ 
sors can be added as needed to increase both CPU 
and input/output capabilities, this database engine 
can handle very large databases. 4 White Cross of¬ 
fers a database engine based on the Inmac trans¬ 
puter. The hardware and software structure in this 
database engine permits a maximum of 100,000 
MIPS of processing power. 5 

The articles in this special issue on database 
computers fit into two main classes. The first class 
centers on hardware accelerators for database 
computers, and the second class presents two dif¬ 
ferent database computers. The proposed hard- 
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ware accelerators concentrate on the efficient implementation 
of specific database tasks. The articles on database computers 
deal with the overall architecture, performance, and data dis¬ 
tribution issues; they also discuss specific hardware accelera¬ 
tors incorporated in the systems. 

The first three articles examine the design and evaluation of 
special-purpose accelerators for database computers. Lee et al. 
present two VLSI accelerators for formatted and unformatted 
databases. The database tasks performed by the accelerators 
include operations such as associative search, aggregation, and 
string search. 

Faudemay and Mhiri present an associative accelerator 
whose speedup ratio is independent of the database size. The 
accelerator implements relational database operations and sort¬ 
ing and aggregate functions. 

Our article describes the design and simulation of a bit- 
serial accelerator for statistical aggregation operations. A num¬ 
ber of bit-serial processing elements connected according to 
the odd-even network topology make up the accelerator. The 
proposed unit achieves a high degree of pipelining and 
parallelism. 

The remaining two articles describe two database comput¬ 
ers and emphasize both architectural and performance con¬ 
siderations. Hsiao presents an experimental database computer 
with a variable number of database processors known as the 
Multiback-end Database Supercomputer (MDBS). Each micro- 
processor-based processor has a private database store, con¬ 
sisting of a small Winchester drive for paging and metadata 
and a large drive for the database. 

Finally, Inoue et al. describe a relational database processor 
with specialized hardware for searching and sorting known as 
Rinda. Rinda is composed of two hardware accelerators, the 
CSP (Content Search Processor) and ROP (Relational Opera¬ 
tion Accelerating Processor). The CSP searches rows of data 
on a disk storage and transfers the selected rows to the main 
memory. The ROP subsequently sorts the row transfers to the 
main memory. 

We thank the reviewers who helped us referee the submit¬ 
ted papers and the IEEE Micro editorial staff. This special issue 
would not have appeared without their assistance. IB 
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VLSI Accelerators for 
Large Database Systems 


VLSI accelerators speed up time-consuming database operations while maintaining the cost 
and flexibility benefits of general-purpose computers. An experimental VLSI relational data 
filter performs high-speed associative searches, and a text filter searches for strings at up 
to 1 Gbyte/s. 


Kuo Chu Lee ttempts to improve the performance 

W * 1 of query processing generally fall into 
Takako Matoba Hickey one of two categories. Application- 

I specific, or dedicated, database 
Victor w. Mak machines * 1 * * * * ' 6 improve a specific set of applications 

and so usually do not support other types of da- 
Gary E. Herman tabase applications or general-purpose comput¬ 

ing. These machines are expensive to develop 
Bellcore and maintain, so they are only economically fea¬ 

sible where the cost can be justified. 

General-purpose database systems 7 * ' 9 * * avoid the 

problem of specialized hardware by developing 

database system software on general-purpose 

computers. But they are less efficient than dedi¬ 

cated machines with customized I/O and pro¬ 

cessing units. 410,11 

A VLSI accelerator approach can improve the 

performance of database processing on general- 

purpose computers. We identified the most time- 
consuming database operations on general- 

purpose computers and designed application-spe¬ 
cific accelerators to speed up these operations. 
By using a rapid-turnaround design methodol¬ 
ogy, 12 * * * we can correctly and efficiently design VLSI 
accelerators. 

Associative search, aggregation, and string 
search are among the lengthy operations we iden¬ 
tified. Associative searches and aggregations are 
perfonned on formatted relational records consist¬ 
ing of multiple attributes of specific types and 
lengths. The associative search selects records 


based on a user-defined predicate that specifies 
the relationships between multiple pairs of 
attributes and search patterns. The aggregation 
operation sums or counts an arbitrary attribute of 
all the selected records. The string search finds all 
occurrences of a pattern in an unformatted data 
string. 

All of these operations require scanning a large 
portion of data from the storage devices. As a 
result, the evolution of mass storage technologies 
such as optical disks 1315 and silicon memory stor¬ 
age 16 has strongly influenced the fundamental 
design assumptions for VLSI accelerators. These 
advanced device technologies make possible data 
transfer rates of up to a few Gbytes/s. At such high 
I/O rates, direct transfer of data from storage to the 
CPU is likely to cause thrashing. The CPU cannot 
perform useful work while it handles frequent 
page faults caused by the increased I/O opera¬ 
tions. Therefore, a critical function of the VLSI 
search accelerators is to exclude as much irrel¬ 
evant data as possible before delivering data to the 
CPU. CPU cycles can be saved for other general- 
purpose tasks, such as index traversal, updates, 
transaction processing, and statistical analysis. 

We designed and had fabricated an experimen¬ 
tal VLSI accelerator research prototype for rela¬ 
tional data filtering. This relational data filter 
performs high-speed associative search and 
aggregation operations for formatted databases. 
For unformatted databases, we designed an 
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experimental fast string search VLSI accelerator that provides 
a few orders-of-magnitude improvement in text search speed. 
The VLSI accelerators have precise instruction sets and hard¬ 
ware interfaces that facilitate their integration into general- 
purpose computer systems and dedicated systems. 1 " 

Formatted database accelerator 

The experimental relational data filter speeds up fonnatted 
database operations by supporting associative search and ag¬ 
gregation operations. Speeding up the associative search is 
not a simple task since it requires either a full database scan or 
a full indexing scheme. A full scan of the database in a gen¬ 
eral-purpose computer architecture is very costly. Full index¬ 
ing requires large memory space—at least proportional to the 
number of data items stored—and incurs large overhead for 
update and database administration. The full indexing scheme 
works well for selection queries based on indexed attributes. 
It is ineffective, however, for complex queries that require 
scanning a large portion of the database, such as range que¬ 
ries, selection queries with don 't care characters in the criteria, 
and queries that take statistical measures on some attributes. 
Thus, improving the performance of associative searches is a 
fundamental need in fonnatted database systems. 

We approached the problem of speeding up the full data¬ 
base search operations by using a VLSI relational data filter 
that scans directly at the high-speed storage output. In sup¬ 
port of associative search operations, the data filter selects 
interesting records from the storage out¬ 
put and passes only those to the main 
memory, This allows the general-purpose 
processor to concentrate on more com¬ 
plex operations on the smaller set of 
records. The data filter also provides 
high-speed database aggregation opera¬ 
tions such as COUNT, SUM, and MAXI¬ 
MUM, based on arbitrary selection 
criteria. 

Architecture. The VLSI data filter can 
be viewed as a reduced instruction-set 
computing (RISC) processor, 1819 with the 
instruction set optimized for associative 
search and aggregation operations. A 
more complex comparator, such as an 
ALU, is needed to enable the filter to 
perform more complex aggregation 
operations. However, multiple complex 
comparators are too costly. The archi¬ 
tecture as illustrated in Figure 1 achieves 
parallelism by efficiently sharing the 
single comparator unit across multiple 
queries stored in the instruction buffer. 

The data filter accepts records continu¬ 
ously from the storage output, synchro¬ 


nizes the stream speed to the internal speed, and holds each 
record in an internal record buffer. Records are double-buff¬ 
ered: One record is resident in one buffer and is examined, 
while the next record is arriving in the other buffer. Instruc¬ 
tions are also double-buffered: Instructions in one buffer 
are executed against a series of records, while the next batch 
of instaictions is loaded into the other buffer. Selected por¬ 
tions of the resident record are evaluated in the execution unit 
(an ALU and its associated logic) according to the instructions 
in the instruction buffer. If any of the instruction sequences are 
evaluated to be true, that is, if the record satisfies a selection 
query, the ID of the query passes off the chip to the main 
memory. If any queries are satisfied, the record passes to the 
main memory at the end of the execution time interval. 

The on-chip instruction and record buffers improve the 
speed of instruction and operand fetch and thus improve the 
performance of the associative search. Since an associative 
search requires repetitive application of the same query against 
the search attributes of all the records in a relation, a reduc¬ 
tion in instruction fetch time significantly reduces the overall 
search time of a single queiy. Reducing the operand fetch 
time allows more instructions to be applied against each 
record; thus, the apparent search parallelism of the data filter 
is also improved. 

Pipelining further improves the apparent execution rate 
and search throughput by overlapping operations in differ¬ 
ent stages of the data filter. For example, our prototype 
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Figure 1. Experimental data filter architecture. 
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VLSI accelerators 


executes each instruction in six pipeline stages including 
instruction fetch, instruction decode, operand fetch, ALU 
execution, register store in parallel with Boolean evaluation, 
and output of query ID. 

The size of the record buffer, instruction buffer, and data 
path can be engineered to suit the application need and tech¬ 
nology used. We targeted our prototype for 64 Mbytes/s for 
the record input stream and 62.5 ns for the processor cycle 
time. With a 32-bit data path and a 128-byte record buffer, 
we determined that the instruction buffer holds 32 instruc¬ 
tions, the maximum number of instructions that can be 
executed in the record interarrival time of 2 gs. A longer 
record size can be supported by using external buffers. The 
portion of data specified in the search predicates are stored 
in the record buffer. We expect that the search attributes for 
a typical relational record will not exceed 128 bytes. 

There are four major blocks inside the data filter. 

• Record buffer control. The record buffer control syn¬ 
chronizes the arrival of incoming records, manages the 
double buffering of records, and synchronizes signals 
for the external control processor. The synchronizer 
detects only a few special words such as start of record, 
end of record, start of segment, and end of segment. When 
it finds any of these words, the synchronizer outputs an 
appropriate boundary indicator and adjusts the address 
register to place the incoming information in the appro¬ 
priate location within the record buffer. With minimal 
embedding of schema information in the data stream, 
the data filter hardware is free from the runtime decod¬ 
ing of schema information that can slow down filter 
processing. 

• Instruction buffer control. The instruction buffer con¬ 
trol manages the double buffering of instruction batches, 
sequencing, and decoding. We kept the logic of instruc¬ 
tion buffer control simple by writing a simple instruc¬ 
tion set, as discussed later. Without branch instructions, 
the sequencer becomes a simple counter with a reset 
logic. The small and orthogonal instruction set enables 
simple and fast decoder implementation using program¬ 
mable logic arrays (PLAs). 

• Execution unit. A parallel data path with an ALU and a 
register file make up the execution unit. Two operands 
are fetched into two operand receiving latches and then 
transferred to the ALU operand latches. By using these 
two close-coupled latches, the operands receive suffi¬ 
cient set-up time for the latches at the ALU. As a result, 
the maximum stage delay of the pipe is very close to the 
clock cycle time if a two-phase, symmetric clock is used. 
The result of the ALU operation moves from the destina¬ 
tion latch to the register file. The operand can be 
selected from the register file, from the record buffer, or 
from the pattern specified by the instruction. All the com- 


SELECT record FROM relation WHERE boolean expression 
FOR ID 

where boolean expression ::= term I term OPb boolean 



expression 

OPb 

::= a | v 

term 

::= attribute OPa value 

OPa 

::= > 1 < 1 = 1 > 1 < 1 * * 


Figure 2. Boolean select operation. 

parison and data movement operations are carried 
through the ALU. In the case of a data dependency, the 
temporary register can be used to store the result of the 
first ALU operation and to direct it to the ALU input latch 
for the next instruction to save translation time and reg¬ 
ister locking overhead. 

• Boolean result analyzer. The most important feature 
of the experimental data filter is the fast Boolean predi¬ 
cate evaluation and the conditional assignment capabil¬ 
ity defined in the instruction set and supported by the 
internal architecture. Two single bits, A and Hit, imple¬ 
ment the continuous evaluation of conjunctive normal 
form and conditional assignment that we use to avoid 
conditional branch instructions, as we discuss later. A 
comparison result derived from the ALU flags and the 
test condition is first evaluated with a Boolean accumu¬ 
lator A under the opcode control. Then, the result of the 
evaluation is stored back into the A accumulator. If the 
result of the last Boolean operation pertaining to a query 
is true, a Hit flag is set. This Hit flag is used by the 
conditional assignment instructions. If it is set, the result 
of the assignment will be written into the register file; 
otherwise, the result is discarded. 

Instruction set. The instruction set of the data filter blends 
complex instruction-set computing (CISC) and RISC philoso¬ 
phies. Each instruction exhibits high semantic content and 
performs multiple actions, as advocated in CISC. However, 
the number of different instructions is minimized to simplify 
the instruction decoding, as advocated in RISC. The instruc¬ 
tion set has two major modes, one to improve the associative 
search operations and the other to enable aggregation opera¬ 
tions. 

Boolean mode. The primary objective of the data filter is to 
speed up the record selection operations. The data filter takes 
fixed-size relational records 20 as input data and relational 
queries as instructions to be executed against the input records. 
Records that satisfy the query predicates become outputs. 
The basic format of the select operation is given in Figure 2. 
As an example, the query, “What Indian restaurants are in 
New York?”, may be written in relational algebra. 
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SELECT record FROM (relation = “restaurants”) 

WHERE (type = “Indian”) a (location = “New York”) 

FOR (Query 1) 

The design of the instruction set to realize this selection predi¬ 
cate directly affects search performance. For example, if we 
assume each string can be coded into one comparison unit, 
say a long integer, we might implement this query in a general- 
purpose processor as 


with Pattern, and, if the comparison flags (LT, GT, EQ) satisfy 
the flag pattern specified by OPa, the filter sets the Boolean 
value to true. Finally, it performs the OPr operation on this 
Boolean value. 

A select query may be decomposed into multiple data fil¬ 
ter instructions, one instruction for each of the attribute com¬ 
parisons. For example, the selection query previously 
presented may be decomposed as 


addrO: 


cmp 

@relation, Restaurants 

PUSHA(EQ(relation, “restaurants”, f), 1). 
ANDA(EQ(type, “Indian”, f), 1). 

jneq 

addrl 

SANDA(EQ(location, “New York”,f), 1). 

cmp 

@type, #Indian 


jneq 

addrl 

Thus the data filter executes the selection query in only three 

cmp 

@location, #New York 

machine cycles. We intentionally avoid using the exact format 

jneq 

addrl 

here to better illustrate our idea. A similar format can be used 

mov 

hit, #1 

as an input to a front-end compiler that compiles the input 


addrl: 

Three comparisons correspond to each of the compari¬ 
sons in the select clause with each followed by a branch 
instruction. 

To improve execution efficiency, the data filter architec¬ 
ture includes Boolean instructions. These instructions speed 
up the selection operation through the use of the Boolean 
accumulator A, which accumulates the Boolean result per¬ 
taining to one query. Three basic operations can be per¬ 
formed on the Boolean accumulator: PUSHA, which pushes 
a Boolean value to A; ANDA, which Ands another Boolean 
value to the value in A; and ORA, which Ors another value to 
the value in A. 

The prototype only supports a one-bit accumulator, but it 
can be easily extended into a stack to support a wider range 
of truth operations, such as (B a Q v (D a E), where B , C, D, 
and E are Boolean truth values. 

In addition to the three basic operations, four operations 
on A are defined to tie together the truth values pertaining to 
one query. FANDA and FORA are like ANDA and ORA but 
are used for final instructions pertaining to one query. These 
two instructions turn on the Hit flag, which can be used as a 
condition for arithmetic instructions discussed in the next 
section. SAND A and SORA are also used for final instruc¬ 
tions, but these turn on the Ghit flag, which causes the out¬ 
put of the current record. 

The basic format of a Boolean mode instruction is given by 


into the binary form understood by the actual filter. In the 
example, we use strings such as “relation” to represent memory 
addresses and use mask f to mean we want the entire 
attribute. 

Arithmetic mode. The second objective of the experimen¬ 
tal data filter prototype is to support aggregation operations 
such as COUNT, SUM, and MAXIMUM. Adding this capabil¬ 
ity to the filter offloads the main CPU from loading and scan-' 
ning large amounts of data. It also allows simultaneous 
execution of aggregation operations that cannot be easily 
achieved in conventional database systems. 

Aggregate operations can be supported in the data filter by 
adding a conditional assignment capability that allows accu¬ 
mulating the result to a register depending on certain condi¬ 
tions. A conditional assignment is conventionally implemented 
by a comparison followed by a branch followed by an 
assignment instruction. However, this approach can be 
inefficient due to the pipeline breaks. Hence the data fil¬ 
ter defines an instruction mode that can perform the AS¬ 
SIGNMENT operation based on a condition code. This type 
of instruction together with the Boolean instructions previ¬ 
ously discussed allow numeric operations to be carried out 
without pipeline breaks. 

The basic format of the arithmetic mode instruction is given 
by 

OPalu(Cond, Ml, M2, M3). 


OPr(OPa(MEM, Pattern, Mask), ID). 

where OPr is the operation on accumulator A just discussed. 
This one instruction specifies three operations to be executed 
in sequence: a comparison, a test on comparison result, and 
an operation between the result and the accumulator A. The 
data filter first compares the word at address MEM (mask) 


The semantics of the instruction are: If the condition speci¬ 
fied by Cond is true, carry out the ALU operation OP ALU on 
two operands Ml and M2 and store the result into M3. The 
condition can be true (unconditional), hit (set by Boolean 
instructions), and six combinations of three ALU flags: less- 
than, greater-than, and equal. Ml can be a memory address 
or a register, M2 can be an immediate operand or a register, 
and M3 is a register. 
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The main loop of the counting query, “How many people 
live in San Francisco?”, can be compiled as 

PUSHA(EQ(relation, “person”, 0, 2). 

FANDA(EQ(town, “San Francisco”, f), 2). 

INCB(Hit, null, sumreg, sumreg). 

A sum query can be formed by replacing the increment 
operation with an add operation on the summing attribute, 
say salary, with the sum register. 

ADDCHit, salary, sumreg, sumreg). 

A maximum query can be formed by replacing the increment 
operation with a compare operation on a maximizing 
attribute, say age, with the maximum register followed by a 
conditional assignment of the maximizing attribute to the maxi¬ 
mum register. 

COMP(Hit, age, maxreg, null). 

EQUA(Gt, age, null, maxreg). 

We can extend operations on accumulator A so that the 
accumulator can be set or cleared based on application- 
specific condition flags for linking computation results of 
added functional components. For example, we can add a 
string search unit, which we describe later, to the data filter 
as a functional component. The result of a string search op¬ 
eration, finding a pattern in the input stream, can be kept in 
a condition code and used to set or clear the accumulator. 

Discussion. The programmability of the filter lends it to 
general applicability in a variety of applications requiring high¬ 
speed filtering of structured data. Further, the filter architec¬ 
ture scales well with improved fabrication technology, 
supporting a faster evaluation rate and more complex query 
sets for finer fabrication geometries. 

The VLSI data filter performs better for synchronous search 
of structured data than both CISC and RISC general-purpose 
microprocessors. This advantage derives primarily from the 
architectural support provided both for pipelined Boolean 
predicate evaluation and for efficient I/O through double 
buffering of data. We avoided pipeline breaks by using Bool¬ 
ean evaluation and conditional assignment instructions. The 
parallel data path for instruction and immediate operand store 
prevents a pipeline stall due to double-operand fetch con¬ 
tention. These features permit more pipeline stages while 
avoiding pipeline breaks. The architectural optimizations are 
combined with the overlapped cache refill operations pro¬ 
vided by the multiported record buffer and instruction buff¬ 
ers. Together they allow the VLSI data filter to perform 
synchronous data search faster than a conventional micro¬ 
processor used in the same technology. 

We made the data filter more efficient by integrating sev¬ 


eral subsystems on a single VLSI chip. The experimental VLSI 
data filter includes additional subsystems beyond the micro¬ 
processor itself, including a tri-port RAM for the record buffer, 
dual-port RAM for the instruction buffer, and extensive glue 
logic to support synchronization, instruction sequencing, and 
handshaking. Implemented off chip, these functions would 
require expensive high-speed components to perform as well 
as the integrated design. We estimate that it would take four 
times as much hardware to replicate the data filter and its 
support functionality compared to our prototype design. The 
improvements in execution efficiency, combined with the 
factor-of-four reduction in hardware, provide a bias for 
advocating the use of a VLSI accelerator over an approach 
based on a general-purpose microprocessor. 

Designers can achieve the performance advantage with 
only a modest effort by taking advantage of high-level speci¬ 
fication languages and VLSI tools. Matoba et al. 12 describe a 
rapid-turnaround methodology that we used to realize our 
data filter prototype in nine staff-months. The ability to 
design quickly a VLSI system as complex as the data filter 
should allow custom VLSI to be commercially viable in many 
more application systems than it has been. The ability to 
explore options rapidly with precision should allow systems 
research to progress rapidly beyond the paper design stage 
where ideas often stall for lack of expertise and/or human 
resources. 

The VLSI research prototype fabricated in 2-jim CMOS tech¬ 
nology executes more than 16 million predicate evaluation 
instructions per second and achieves 64-Mbyte/s search 
throughput for individual queries or sets of queries compris¬ 
ing a total of up to 30 Boolean predicates. Table 1 summarizes 
some chip parameters for our first prototype. The chip con¬ 
sists of 91,000 transistors and the area is 475 mils square. The 
chip’s speed is determined by the longest stage in the pipeline 
(54 ns, determined by the speed of the ALU cell). The first 
wafer lot produced 33 working chips, a yield of 16 percent. 
The prototype filter is an integral part of a working experi¬ 
mental prototype database system that efficiently supports a 
rich set of query types plus full transaction semantics. 17 


Table 1. Experimental chip parameters. 

Number of transistors 

91,000 

Chip size 

476 x 477mil 2 

Number of pins 

171 

Cycle time 

54 ns 

Power at 18.5 MHz 

2.5 W 

Supply voltage 

5 V 

Process technology 

2pm CMOS 
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Unformatted string search accelerator 

String searching is another operation that requires exten¬ 
sive scanning of unformatted data in database applications. 
The problem of string searching is to find all occurrences of 
a ^-character pattern P constructed from a vocabulary of m 
distinct characters in an s-character data string S. The pattern 
P may also contain don’t care characters. For typical applica¬ 
tions, p«s and m«s. Since the size of the data string is 
usually very large, sequential search via general-purpose pro¬ 
cessors is prohibitively slow. Thus, reducing the time re¬ 
quired to search a large text database has been an active area 
of research for the past decade. 

We propose a new parallel VLSI algorithm called Data 
Parallel Pattern Matching (DPPM) and a corresponding VLSI 
document-search engine called DPPME. The DPPM algorithm 
differs from most previous work in that it serially broadcasts 
each character in the pattern and compares the pattern to a 
block of data in parallel. The DPPM algorithm uses the high 
degree of integration of VLSI technology to process quickly 
through parallelism. Based on simulation statistics and tim¬ 
ing analyses of the hardware design, we can achieve a search 
rate of multiple Gbytes/s DPPME using advanced VLSI 
technology. 

The DPPM algorithm. Let S[l:n] be the data string of n 
characters to be searched and Pat[l:p] be the pattern of p 
characters. The data string is divided into blocks of b charac¬ 
ters each and searched a block at a time. Let Blk[l:b] be the 
current data block of size b characters. Basically, the DPPM 



Early out 


Partial match 


No match, jump to c 


Hit 


Figure 3. DPPM example. 


algorithm serially broadcasts each pattern character to a block 
of the data string. If the pattern character matches with any 
of the characters in the block, the next pattern character is 
broadcast in the next comparison cycle. If, at any cycle, no 
match is found between the current pattern character and 
the data block, and if no partial match is carried over from 
the previous block, the filter discards the data block. The 
search continues with the next block. A partial match occurs 
when Patti], i<p, matches with the last data block character 
Blktb]. This partial match information is stored and used in 
the next block to continue the search by comparing Pat[i+1] 
to the first data block character Blktl], 

We can best illustrate the DPPM algorithm with a simple 
example. Suppose we conduct a search for the pattern “abed” 
in the data string “abadbbabedee.” Figure 3 shows the opera¬ 
tion of the DPPM algorithm using a block size of four. 

The first block, containing the characters “abad,” is first 
loaded into the comparator array. When compared to the 
first pattern character “a,” two matches are detected. The sec¬ 
ond pattern character “b” is then broadcast and compared to 
the characters to the right of the matched characters in the 
previous cycle, that is, “b” is compared to the second and the 
fourth characters in the block. Since a match is detected again 
at the second character, a comparison of the third pattern 
character “c” with the third character in the block is neces¬ 
sary. This time, no match in the block is observed; thus, the 
current block is discarded and the search continues with the 
next block. This early mismatch detection mechanism avoids 
broadcasting and comparing the fourth pattern character “d” 
to the current block since this comparison is unnecessary. 

The next block contains the characters ‘bbab.” The pat¬ 
tern compares successfully up to the second character “b.” 
At this point, the end of the block is reached. The pattern 
may span the block boundary. DPPM acknowledges this partial 
match information and continues the match in the next cycle. 
Since no other match occurs in this block, the current block 
is also discarded. 

The third block contains the characters “edee.” The first 
pattern character V has no match with the block. At this 
point, DPPM acknowledges a partial match in the previous 
block up to the second pattern character. Therefore, it jumps 
to the third pattern character “c” and continues the partial 
match from the previous block. Finally, a hit or an occur¬ 
rence of the pattern is detected with the fourth pattern char¬ 
acter “d.” 

Although not shown in this example, the DPPM algorithm 
can also detect multiple occurrences of the pattern even if 
they overlap, and no backtracking is required to detect all 
occurrences. 

Figure 4 on page 14 shows the pseudocode of the control 
flow of the DPPM algorithm. DC[l:p] is a bit-vector indicating 
the don’t ^repositions in the pattern. DC[i] is set if there is a 
don’t care at Patti], Mask[l:b] controls the activation of the 
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comparator array based on the results of the previous com¬ 
parison cycle. If Patll] matches Blk[i] in the first comparison 
cycle, then Mask[i+1] is set in the next comparison cycle en¬ 
abling the comparison between Pat[2] and Blk[i+1]. T[l:b] holds 
the results of the comparator array. Vin[2:p] and Vout[l:p-l] 
hold the partial match information from the previous block 
and the current block, respectively. Vinli] is set if a partial 
match is found in the previous block up to and including the 
character Pat[i—1]. Voutli] is set if a partial match is found up 
to and including the character Patli] in the current block. 

With the use of Vin and Vout, partial match can be contin¬ 
ued in the next block without any backtracking of the data 
string. Since the algorithm is independent of the block size, 
the search rate can be increased simply by increasing the block 
size, or the number of comparators. As a result, the DPPM 
algorithm can efficiently use the high degree of integration of 

while (-.End of Data String) do begin 

GetNextBlock(Blk); 

forall i from 1 to b do Mask[i] := TRUE; 

forall i from 1 to (p-1) do Vout[i] := FALSE; 
i := 1; 

while i < p do begin /* Comparison Cycle 7 
forall j from 1 to b do 

T\i] := Mask[j] a (DC[i] v (Blk[j] = Pat[i])); 
if (i < p) then Vout[i] := T[b] 
else begin /* i = p: Report hit found 7 
forall j from 1 to b do 

if (T[j]) then Report Hit at j; 
break; /* Match(es) in block 7 

end; 

if (V jy T[j]) then begin 

/* Prepare Mask for next comparison cycle 7 
forall j from 2 to b do 
Mask[j] :=T[j-1]; 

Mask[1] := Vin[i+1]; 

i := i + 1; 

end 

else if (Vj= i+ i Vin[j]) then begin /* Continue partial 
match 7 

/* Skip Unnecessary Pattern Characters 7 
do i := i + 1 until (Vin[i]); 

Mask[1] := TRUE; 

forall j from 2 to b do Mask[j] ;= FALSE; 
end 

else I* Early Out 7 
break; 

end; 

forall i from 2 to p do Vin[i] := Vout[i-1 ]; 
end; 


Figure 4. Pseudocode of the DPPM algorithm. 


VLSI technology to attain high-speed processing through par¬ 
allelism. 

VLSI prototype design. Figure 5 shows the circuit block 
diagram of the experimental DPPM engine. Before the actual 
search operation, the pattern, pattern length, and don’t care 
(DC) positions are first loaded into their corresponding regis¬ 
ters. The data string is buffered and read one block at a time to 
the Block register. The comparator array performs the actual 
comparison between the pattern character and the data block. 
The results of the comparator array are Anded with the mask 
to form register T. The DPPM engine integrates the mismatch 
detection and the partial match propagation mechanisms by 
combining the Vin register (partial match information from 
previous block) with the old T register (match results of the 
previous comparison cycle) to form the mask for each com¬ 
parison cycle. The first bit of the mask is from Vin[i], and the 
last (b-1) bits are from the first (b-1) bits of T in the previous 
cycle. The last bit of T is stored into Voutli]. Each time a new 
data block is read, the first (p-1) bits of Vout are loaded into 
the last (p-1) bits of Vin. The first bit of Vin is always set. 

The sequence controller controls the operation of the DPPM 
engine by generating the value i, which is used to index the 
pattern, don’t care, Vin, and Vout registers. By monitoring the 
values of T, the content of the Vin register, and the pattern 
length, the sequence controller decides for each cycle one of 
the following three actions: 

• Compare the next pattern character with the current 
block, 

• Jump to a pattern character to continue the partial match 
from the previous block, or 

• Discard the current block and continue the search 
with the next block. 

Step 1 is taken if this is not the last pattern character and the 
content of T, that is, the number of matches detected in the cur¬ 
rent cycle, is nonzero. If T is zero, then the index of the next 
pattern character to be used for comparison is determined by a 
priority encoder that encodes the first nonzero bit in the Vin 
register after masking off the first (i-1) bits of the Vin register 
using a linear shift register. Step 3 is taken if the last pattern 
character is reached, or if T is zero and there is no more partial 
match propagation from the previous block (early out). 

The sequence controller also generates a last character sig¬ 
nal when the last pattern character is reached. This signal is 
used by the hit detection unit, which checks the values of T to 
report any hit in the search. The priority encoder produces the 
encoded addresses for all hit positions in the current block. 

The critical path of the circuit is from the pattern register, 
through the comparator and And arrays, to the priority 
encoder. Using 2-|im CMOS technology, we achieve the com¬ 
parison cycle time of about 50 ns for a block size of 1,000. This 
cycle time includes the broadcasting delay of 6 ns. If 1.2-pm 
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Figure 5. Circuit block diagram of the experimental DPPM engine. 



CMOS technology. We can achieve even higher 
bandwidth by using advanced packaging technol¬ 
ogy that supports higher I/O pin counts. 22 

Performance evaluation. We can evaluate the 
DPPM algorithm by simulating its operation on a 
real text database. Under such an environment, 
we can estimate its performance, and also learn 
about its behavior. We chose as a database the 
Associated Press wire news articles on August 2, 
1988, a 4.4-Mbyte database. For test patterns, we 
chose the 100 most frequently occurring words in 
American English that are at least five characters 
long. The pattern lengths vary from 5 to 10 char¬ 
acters with an average of 5.88 characters. 23 

Figure 6 shows the average number of com¬ 
parison cycles per block C\ measured at different 
block sizes. Although the pattern characters are 
serially compared to the data block, early mis¬ 
match detection allows the algorithm to search 
the next block as soon as a mismatch is detected. 
This feature is especially effective at smaller block 
sizes where the probabilities of matching the first 
few characters of the pattern are low. Without 
early mismatch detection, C equals the average 
pattern length, in this case 5.88. At larger block 
sizes, the probability of finding the pattern in the 
block is higher, thus the value of Calso increases. 
As the block size increases, the value of C 
approaches the average pattern length asymptotically. 

Using 50 ns as the comparison cycle time T, Figure 7 shows 
the search rates R b at different block sizes. Recall that increas¬ 
ing the block size requires proportionately more compara¬ 
tors on a chip. At a block size of 16, the search rate is 212 
Mbytes/s. This rate matches the predicted optical disk trans- 


Figure 6. Average number of comparison cycles per block. 

CMOS is used, the cycle time will be approximately 33 ns. The 
chip area of the DPPM engine for a block size of 128 is roughly 
200 x 100 mil 2 . 

It is important to understand whether the I/O bandwidth of 
a CMOS chip can sustain the Gbyte/s search rate of the DPPM 
engine. Marcus 21 has demonstrated that using 32 input pads, a 
bandwidth of 5.44 Gbytes/s can be achieved with 1.2-|im 



Figure 7. Search rates at different block sizes. 
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fer rate of 200 Mbytes/s. 13 At a block size of 128, the search 
rate reaches 1 Gbyte/s. This rate is sufficient to handle exist¬ 
ing memory bandwidth of supercomputers as well as data 
input from optical-fiber transmission systems in future com¬ 
munication networks. 

Extension and integration. Instead of comparing the 
pattern characters serially to the data block as in the DPPM 
algorithm, it is possible to compare multiple pattern charac¬ 
ters to the data block. The Multiple-Pattern, Multiple-Data 
(MPMD) approach reduces the number of comparison cycles 
required per block and thus results in a higher search rate. 
Since multiple-pattern characters can be grouped into a word 
with proper length (for example, four characters long for a 
32-bit processor), the MPMD engine can be easily integrated 
with the relational data filter previously described so that the 
search operation can be performed on both formatted and 
unformatted data. 

An MPMD engine uses a 2D array of comparators to per¬ 
form multiple steps of the DPPM algorithm in parallel. Pat¬ 
tern characters can be loaded and compared in parallel to 
the input data block. The partial match traces propagate 
through the comparators in a diagonal direction and termi¬ 
nate at the V registers for partial matches and T registers for 
matches. The whole parallel comparison operation can be 
implemented in one or more pipeline stages. Notice that the 
maximum length of the critical path equals the minimum 
length of the input block and the pattern. 

To integrate the MPMD engine in the experimental rela¬ 
tional data filter, the designer must provide interfaces to pass 
the control information between the MPMD and the data 
filter instruction set. A Match flag can be used to indicate one 
or more matches terminated at the current input data block 
(indicated in the T register). This Match flag can be used as a 
condition code for Boolean predicate evaluation instructions. 
Furthermore, since all the state information of the string search 
operation is stored in the V and T registers, these registers 
can be mapped as a part of the register file. Since the string 
search operation is performed while the data is loaded into 
the relational data filter, the search results for each input 
record can be stored in flags and registers. The text is also 
segmented into 128-byte records. The results are then for¬ 
warded to the relational filtering operation in pipelined fash¬ 
ion. The relational data filter can then use and manipulate 
directly the search results stored in the condition codes and 
in the general-purpose registers. 

The speed of the MPMD circuit is fast enough to imple¬ 
ment in one pipeline stage that matches the VLSI chip I/O 
rate. The critical path of the MPMD circuit for an input data 
block size of 4 bytes wide (8 bits per byte) consists of a byte 
comparator plus four stages of combinational logic for partial 
match propagation. Using a conservative VLSI technology 
with about 300-ps gate delay time, the byte comparison will 
take about 5 ns (including pattern and data broadcast time, 
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Figure 8. Accelerator and processor integration. 

XOR comparison logic, and eight carry propagation stages.) 
The partial match signal propagating through four stages (4- 
byte comparator in diagonal) will take about 1.2 ns. Thus, 
the total time of each parallel comparison will be less than 8 
ns. Since this comparison speed matches the maximum data 
input rate of high-performance I/O pads, 21 it is unnecessary 
to introduce extra pipeline stages for the comparator array. 

For concurrent query evaluation, we can construct mul¬ 
tiple-parallel MPMD building blocks that allow multiple text 
patterns to be searched simultaneously. To do that, multiple 
Match flags can be indexed for further Boolean predicate 
evaluation. With current VLSI technology, we estimate that 
thirty-two 8-byte-long text patterns can be compared in par¬ 
allel (total 1,000-byte comparators) in one VLSI chip. This 
estimation is conservative; however, a small VLSI chip may 
significantly increase the yield and reduce the cost. Such an 
accelerator can search 32 text patterns concurrently at about 
500-Mbyte/s input data rates. 

VLSI accelerators can achieve very fast string searches, at 
rates easily exceeding the I/O data rate of VLSI chips. The 
density of current VLSI technology can further improve the 
throughput of concurrent string searches. The test string match¬ 
ing operations can operate synchronously with the Boolean 
predicate evaluation operations used in data scan operations. 

Application notes 

The proposed experimental VLSI accelerators can enhance 
the performance of general-purpose computing and dedi¬ 
cated systems. 

General-purpose computer systems. We use an abstract 
data processing model to illustrate the applications of the 
experimental VLSI accelerators for general-purpose comput¬ 
ing systems. The abstract model consists of a set of storage 
devices and a set of VLSI search accelerators interconnected 
by a high-performance bus interconnection network. This 
network provides fast virtual circuit connections for commu¬ 
nication sessions required by the applications. As shown in 
Figure 8, designers can easily incorporate the accelerators 
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into a board-level processing node. A typical processing node 
consists of a CPU with high-speed caches as well as a large 
set of dual-ported main memory. The accelerators process 
data on a page-by-page basis. The data comes through the 
bus from either the main memory or other nodes. While the 
data filter is executing the search and aggregation opera¬ 
tions, the CPU can execute other tasks. When the search and 
aggregation operations are completed, the search results 
can be accessed by the CPU through the main memory. 

For a parallel computer, multiple processing nodes are 
connected by an interconnection network (the design of which 
is beyond the scope of this article). A search operation starts 
from the loading of search instructions into one or more data 
filters. Communication channels between the data sets and 
the accelerators are established to form a filter network. After 
the filter network is formed, the data passes through the net¬ 
work in a pipelined fashion. The input and output of each 
data filter are buffered synchronously or asynchronously, 
depending on the applications. Figure 9 shows a tree struc¬ 
tured associative search task designed to select data records 
satisfying the predicate ((C and A) or (C and B)), where A, B, 
C are patterns. The first two accelerators search for two dis¬ 
junctive patterns A and B, respectively, from two data sets. 
The third accelerator searches for pattern C from records 
selected by the first two filters. 

Dedicated systems. The experimental VLSI data filter is a 
critical component in a large-scale broadcasting database sys¬ 
tem prototype called the Datacycle architecture. 17 The 
Datacycle architecture demonstrates the feasibility of provid¬ 
ing a high-performance and highly flexible database access 
services for a large number of users in geographically distrib¬ 
uted areas. Datacycle consists of a set of data and data filters 
located in multiple distributed sites. The system broadcasts 
data to data filters cyclically through optical fibers. (To sup¬ 
port applications with a very large volume of data, only the 



Figure 9. A filter network. 


part of the data that is constantly changing or frequently ac¬ 
cessed by multiple sites are broadcast repetitively.) Search 
queries are translated into data filter instructions and distrib¬ 
uted to multiple data filters that execute them concurrently. 

Since the data is made available to all the data filters simul¬ 
taneously, the system can concurrently process an unlimited 
number of queries. This broadcast property is fundamental 
in supporting concurrent associative search, aggregation, and 
transaction processing services for millions of users on a set 
of constantly changing data. 

To demonstrate its implementation feasibility, we installed 
a prototype system of the Datacycle architecture in our labora¬ 
tory. It consists of silicon memory-based data storage 
devices, 500-Mbps fiber-optic broadcasting links, and mul¬ 
tiple formatted data filter boards. The architecture of a dis¬ 
tributed site containing data sets is similar to the 
general-purpose node architecture shown in Figure 8. A dis¬ 
tributed site that performs only data accessing contains three 
VLSI data filters and associated controllers. The prototype 
system demonstrated the functionality of the Datacycle archi¬ 
tecture. An extended query set can be programmed with the 
data filter instructions. These extended queries execute in 
two phases. First, the data filter prefilters to reduce the amount 
of data to be processed later. Second, the general-purpose 
CPU completes the query processing and retrieves the data. 

We learned several lessons from the Datacycle prototype. 
First, the effort involved in the VLSI accelerator design is very 
small compared to the total effort involved in software devel¬ 
opment and system integration (less than 20 percent). This is 
partly due to the advancement in CAD tools but mostly due 
to the increasing complexity of system and application soft¬ 
ware. Second, the use of VLSI accelerators significantly 
improves the performance of the overall system and enables 
new capabilities more economically. 

Frieder, Lee, and Mak proposed a document-searching 
architecture, based on the DPPM engines, to increase the 
throughput of an information retrieval system. 23 - 24 The pro¬ 
posed architecture (Figure 10 on page 18) is comprised of a 
set of DPPM engines and a single master processing element 
(PE). An information retrieval query is decomposed by the 
PE into basic match primitives to be executed in the DPPM 
engines, while the query (a sequence of operators) is evalu¬ 
ated at the PE using results from the document-search 
engines. By separating the operator and query complexity 
from the DPPM engines, the complexity of the customized 
hardware is freed from the complexity of the query. 

The PE instruction set is based on the text retrieval 
machine instruction set presented by Hollaar et al. 25 but modi¬ 
fied to be match-based. In other words, each instruction is 
defined as a set of match conditions with a simple imposed 
control structure. The instruction set 23 supports Boolean 
operations of patterns as well as wild-card characters that 
match one or more characters in the pattern. As the imposed 
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Figure 10. Proposed document-searching architecture. 

control structure requires minimal computation time relative 
to the amount of processing involved in searching the docu¬ 
ment stream, it can be implemented in software without sig¬ 
nificantly affecting the system throughput. A 50-M1PS PE can 
support 10 DPPM engines of 128 blocks in parallel, at a utili¬ 
zation level of 0.25, with each DPPM engine having a search 
rate of 1 Gbyte/s. At this search rate, we can search both the 
Old and the New Testaments of the Bible in 5 ms, Webster's 
Dictionary in 16 ms, and the entire volume of Encyclopaedia 
Britannica in 400 ms. 

Related work 

Numerous hardware-based solutions have been proposed 
and implemented to expedite both the associative and string 
search operations. As fast software string-searching algo¬ 
rithms 2627 are based on finite-state automata (FSA), hardware 
realizations of FSA pattern matching were investigated by 
several researchers. 28 " 30 FSA requires precompilation of the 
patterns and processes the data string one character at a time. 
Although precompilation of the pattern eliminates the need 
to compare each character of the data string to every pattern 
character, the sequential character-at-a-time processing se¬ 
verely limits the search rates of these systems. 

Several researchers 31 ' 34 propose using comparator arrays to 
perform pipelined pattern matching directly without 
precompilation of the patterns. Multiple patterns are com¬ 


pared concurrently to the data string to achieve higher through¬ 
put. However, the search rate is still limited by the sequential 
processing of the data string. In the systolic array approach, 31 
data and pattern characters are routed in opposite directions. 
At any given clock cycle, only half of the cells in the anay 
can perform meaningful computation. Therefore, half of the 
physical hardware is actually wasted. In some proposals, 32 ’ 34 
pattern characters are preloaded into the comparator cells. 
Each character of the data string is broadcast into all cells 
serially, and comparison results are generated by all cells 
simultaneously. Since a string search operation on text data¬ 
base exhibits very low selectivity, comparisons beyond the 
first few characters of the pattern are usually unnecessary. 
Thus, most of the comparisons with the last few characters of 
the pattern are redundant. 

To reduce the number of redundant comparisons and to 
increase the degree of effective parallelism in the pattern 
matching problem, both Altep 35 and our string-search algo¬ 
rithm use a data parallel, pattern serial scheme in which pat¬ 
tern characters are broadcast and compared to a block of the 
data string in parallel. While Altep is a cellular processor 
optimized for regular expression comparisons with 
microprogrammed control, our DPPM engine is a VLSI filter 
optimized for variable-length text processing with hard-wired 
control. The decoupling of query resolution from the primi¬ 
tive match operation simplifies the structure of the DPPM 
engine so that it can be implemented compactly and is hence 
more efficient. 

Relational data filtering is different from unformatted string 
searching in that in the former, the starting position of the 
pattern to be matched is specified in the record format defini¬ 
tion. As a result, associative search on relational data 
requires fewer comparison operations than string search. An¬ 
other difference is that relational data filtering requires more 
complex search predicates than does unformatted string 
searching. Therefore, the filters designed for string search are 
not efficient for relational data filtering. 

The proposed relational data filter differs from other fil¬ 
ters 28,29,36 ' 39 in that it is a RISC architecture with a highly pro¬ 
grammable instruction set and efficient implementation. As 
indicated earlier, the instruction set of the data filter can con¬ 
struct a wide mixture of associative search and aggregation 
queries based on efficient Boolean predicate and conditional 
assignment operations. The simplicity of the reduced instruc¬ 
tions also results in fast VLSI implementation. 

The proposed database accelerator can offload 

the CPU from the time- and space-consuming associative 
search and aggregation operations. The novel instruction set 
and Boolean accumulator support allow highly pipelined Bool¬ 
ean predicate evaluation for on-the-fly processing over high¬ 
speed input data streams available in the future storage 
technology. The efficient VLSI architecture results in a 64- 
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Mbyte/s associative search rate via a simple design with modest 
fabrication technology. The architecture can be easily 
extended to support a higher data rate by increasing the 
width of the data path and the number of pipeline stages. 
The VLSI data filter achieves significant performance advan¬ 
tages for synchronous searches of formatted data when com¬ 
pared with general-purpose microprocessors. Its efficiency 
increases when we integrate several subsystems onto a single 
VLSI chip. 

The VLSI DPPM engine overcomes the fundamental limita¬ 
tion of sequential pattern-matching algorithms and efficiently 
processes consecutive text strings in variable block size. We 
can achieve a search rate of Gbytes/s with simple VLSI imple¬ 
mentation of the algorithm. By incorporating the DPPM en¬ 
gine with high-speed data channels available in silicon memory 
and optical storage devices, we can economically resolve the 
problem of slow response time for large information service 
databases. 

Through the actual design and the use of VLSI accelera¬ 
tors, we learned that VLSI accelerators are a very cost-effective 
means for improving overall system performance. Modern 
VLSI design tools can significantly reduce the time and the 
cost believed to be required for custom IC design. As a con¬ 
sequence, the VLSI accelerator approach is a relatively low- 
risk and inexpensive one when the overall software and 
hardware system design and integration costs are consid¬ 
ered. Additionally, we learned that the value of VLSI accel¬ 
erator design is not limited to the system performance gain. 
Basic functional components and I/O structures abstracted 
from the accelerator designs can potentially become stan¬ 
dard features for future general-purpose CPU architecture 
designs. (P 
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An Associative Accelerator 
for Large Databases 


Pascal Faudemay 
Mongia Mhiri 

Laboratoire MASI 


The RAPID-1 (Relational Access Processor for Intelligent Data) is an associative accelerator 
that recognizes tuples and logical formulas. It evaluates logical formulas instantiated by the 
current tuple, or record, and operates on whole relations or on hashing buckets. RAPID-1 
uses a reduced instruction set and hardwired control and executes all comparisons in a bit- 
parallel mode. It speeds up the database by a significant factor and will adapt to future genera¬ 
tions of microprocessors. 


B nformation servers should perform at 
a higher level to be usable for more 
numerous applications. Among such 
applications are real-time information 
systems; large knowledge bases; and very large 
databases, such as those that hold seismic, tex¬ 
tual, and financial data. A solution to the perfor¬ 
mance problem may come from better algorithms, 
more parallel systems, or more calculating power 
in the processor. In any case, good processing 
power is desirable and may reduce the complex¬ 
ity of parallel machines or help to improve per¬ 
formance on sequential machines. 

To increase the power of general processors, 
we can rely on processors themselves, or on ap¬ 
plication-specific integrated circuit (ASIC) accel¬ 
erators. The old reasoning that specific circuits, 
which were not yet ASICs, cannot perform as 
well as general circuits contradicts all research 
on ASIC and ASIC-dedicated CAD. 1 ' 3 Today’s ASIC 
circuits are designed at the schema level, using 
standard cells, and the layout can be shrunk or 
changed automatically, according to the technol¬ 
ogy. Specific architectures have been developed 
for information servers, 4,5 and most of them use 
specific circuits. Recent machines deliver very high 
performance. 6 Among the various accelerator 
types, 6-11 associative circuits 12 ' 14 seem to offer a 
natural solution for extended relational servers 
(see box on information servers) and for knowl¬ 
edge base servers. 

Information servers retrieve and update data, 


based on assertions of the data to be retrieved. 
Associative circuits also retrieve in parallel a set 
of data, based on a description of its content. 
Therefore, associative circuits seem well-adapted 
to the operations of extended relational servers 
and knowledge bases, as well as set-oriented op¬ 
erations on object-oriented information servers. 

Associative circuits use Cartesian product 
algorithms, in which operations on sets of tuples 
(relations) are done by considering all pairs of 
tuples. Each cell or processing element of the 
associative circuit compares a tuple, or part of a 
tuple, which is memorized in this cell with the 
comparand tuple. (See box on associative circuits.) 
It is more efficient to join in parallel two relations 
by Cartesian product than to sort-join in parallel 
when the number of processing elements is a 
significant fraction of the number of tuples of 
one of the relations. 15 

The best way to satisfy this condition is to 
divide the operations into smaller ones compat¬ 
ible with the size of the associative circuits. The 
most efficient solutions are hash-Cartesian prod¬ 
uct algorithms. Comparisons between tuples 
belonging to two source relations are limited to 
comparisons of hashing buckets having the same 
index. 16 Therefore, the desirable number of pro¬ 
cessors is limited to the number of tuples in the 
hashing bucket of the smaller relation, multiplied 
by the number of processing elements used for 
each tuple (usually one PE per tuple). 

Hash-product solutions can be implemented 
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Information servers 


Information servers manage databases in which elemen¬ 
tary data may belong to any type and knowledge bases 
(databases of rules) for the benefit of users and applica¬ 
tions distributed on an interconnection network. Database 
systems with the best performances for the most-used func¬ 
tions are extended relational systems. These systems rely 
on the relational data model, which defines unary and 
binary operations on relations. Relations are subsets of Car¬ 
tesian products of domains, such as integers or real num¬ 
bers. An element of a relation is a tuple, which can be 
represented as a record. Tuples are composed of attributes 
(possibly implemented as fields in a record) that take their 
value in the domains which define the relation. 

The operations of the relational model are 

• SELECTION, which returns the tuples that satisfy a logi¬ 
cal fonnula; 

• PROJECTION, which eliminates duplicate tuples result¬ 
ing from the suppression of some attributes; 

• JOIN, which is a selection on the Cartesian product of 
two relations; and 

• SET OPERATIONS on two relations (union, intersection, 
and so on.) 

Relational languages also include sorting, aggregates with 
or without grouped-by (for example, the average of a sal¬ 


ary grouped by region), and sometimes the transitive clo¬ 
sure on a relation. The RAPID circuit implements all these 
operations, though the transitive closure is not part of its 
instruction set. 

Extended relational models define their attributes on 
domains corresponding to abstract data types defined by 
the user. They possibly can represent polygons or images. 
Tuples and attributes may be of any size up to several 
Mbytes. RAPID processes abstract data types in a simple 
way. 

Other data models that are presently developed are 
object-oriented models in which methods are defined for 
each type of persistent or nonpersistent object and used 
for accessing or updating them. These object-oriented data 
models usually rely on object graphs in which links are 
represented by constant identifiers, not by pointers. These 
models are less advanced in terms of performance, there¬ 
fore less adapted to the implementation of such servers as 
real-time information servers. However, they are used for 
an internal representation of objects in extended relational 
systems, which can use object managers for access and 
updates to objects. The software kernel of the RAPID 
machine uses the services of an object manager. On the 
other hand, object graphs can always be represented 
through the relation super-type, especially to execute 
operations on these graphs in a set-oriented manner. 


by using a software hashing before the Cartesian product or 
by hashing the stored tuples and tuples from the second 
source relation by hardware at storage time and evaluation 
time, as it is done in some set-associative memories. In this 
case, however, one of the source relations must more or less 
remain in the associative circuit, even if the number of pro¬ 
cessing elements is smaller than the number of tuples. 17,18 

We chose to use algorithms based on the hash-product, 
with a software implementation of hashing. Below, we 
define our circuit design goals, the data structures that the 
circuit recognizes, the instruction set and associated meth¬ 
ods, and the architecture and performances. Finally, we show 
how this type of circuit may correspond to the needs of sev¬ 
eral generations of microprocessors. 

Main issues 

Today’s associative circuits do not seem fully adapted to 
general database queries. They are either too specific or too 
slow. In addition, a main issue in the design of an associative 
circuit for databases is whether to use a small associative 
accelerator, which can be added to any machine, or imple- 


Associative circuits 

Associative circuits execute retrievals and updates on 
data based on 

• their content, 

• an assertion on this content, and, most often, 

• on a comparand, or key data, which is compared 
to data stored into the circuit, and used to instanti¬ 
ate the assertion. 

They are mainly composed of an array of associative 
cells, or processing elements, that execute comparisons 
in parallel, and of one or several functional blocks that 
use the comparison results in order to solve the query. 
These circuits can also execute other operations, such 
as sorting, proximity evaluations on the comparand and 
data (for example, Hamming distance), or other 
operations. 
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Associative accelerator 


Associative memory capacity 

Associative memories may be fully associative memo¬ 
ries or set-associative memories. In the latter, data are 
distributed between sets by a hashing function, and a 
comparand compares to memorized data only within 
the set with the same hashing value. 

Fully associative memories with the most advanced 
memory capacity offer 20 Kbits per component, which 
is more than 500 words with 32 bits width for each word. 22 
Solutions w'ith the best integration density use five tran¬ 
sistors per 2-bit cell, that is, 2,5 transistors per bit. 23 If 
used to compare tuples belonging to a couple of hash¬ 
ing buckets, the corresponding size of these hashing 
buckets is very satisfactory and is not even necessary for 
high efficiency. However, with this circuit type, data 
cannot remain in the circuit and must be transferred into 
it before the operation. 

On the other hand, if the circuit is to be used as a data 
cache, it needs a bigger memory. Usual relation sizes 
are between 100,000 and 1 million tuples, with current 
tuple sizes of 100 bytes or more. Database sizes range 
from one Gbyte to terabytes for seismic applications or 
satellite data. The solution may be a set-associative 
memory. 

With set-associative memories, we can separate logic 
and memory, which may be a classic RAM. We can then 
achieve large memory capacities, with approximatively 
twice the same component count and price. However, 
the memory capacities that are presently considered are 
not much larger than 10 Mbytes, which is much smaller 
than database sizes. The slow-down due to disk access 
limits the device speedup to about a factor of two (end- 
of-line). 17 ' 2 '* Set-associative memories have also been 
integrated on a single chip with memory capacities up 
to 160 Kbits per chip. 34 

The hashing function depends on the operation. It 
may therefore require reorganizing the data before an 
operation, which may bring us back to the situation where 
data are transferred from memory. Retrieval through set- 
associative operations in an information server is there¬ 
fore quite different from the usual use of set-associative 
caches, where retrieval is based on the logical addresses 
of data. 

Due to the limited size of associative caches and needs 
for rehashing, smaller associative memories with non¬ 
resident data may presently be a good solution. 


ment a complete associative machine based on an associa¬ 
tive cache. 

Functions. According to the applications, the most costly 


and frequent queries are based on joins, aggregates, sorting, 
and text retrieval. The three queries are implemented only 
by specialized associative memories. 19 ' 21 Therefore, we wanted 
to implement these four operations in a single associative 
processor. We chose to implement all query operations in 
the same hardware to simplify the software. 

We also felt that the length of the tuple must be arbitrary. 
In several existing circuits, the length of a record is limited— 
for example, 32 bytes in Ogura’s 22 and 256 bytes in the Ad¬ 
vanced Micro Devices 85C95. Larger tuples may be processed 
by using several components, but the record length is often a 
characteristic of the machine. In RAPID-1, tuples may have 
any length within the total memory capacity of the circuit, 
which may also use several components. The circuit is opti¬ 
mized for a useful length of the tuple (attributes taking part 
in the comparisons) between 4 and 100 bytes. As in Ogura’s 
circuit and in the Data Base Accelerator, 23 the reduction of 
the useful tuple length may increase the number of simulta¬ 
neously evaluated tuples. 

Memory capacity. Considering the present limitations of 
associative and set-associative cache solutions, we chose to 
design a fully associative circuit with a small amount of 
memory. The general processor transfers data during all op¬ 
erations. Performance depends on software efficiency and 
on the power of the host processor. In this “co-architecture,” 
performance increases with the power of the microproces¬ 
sor, up to the saturation of the associative circuit. This 
arrangement is therefore well adapted to the fast evolution of 
general processors. (See box on associative memory ca¬ 
pacity.) 

Performance. Classic associative circuits cycle in less than 
100 ns, but a comparison set implies several instructions. 
Comparisons using an equality criterium operate in a wide- 
bandwith manner (bits in parallel mode), but comparisons 
on inequality criteria use serial calculations on each succes¬ 
sive bit of the comparand (serial-bit mode). The duration of a 
one-word comparison is thus usually 32 cycles or more. This 
difference in performance between equality and inequality 
comparisons will imply a circuit inefficiency for inequality 
comparisons. 

We implemented bit-parallel comparisons for all opera¬ 
tions, which means the use of a comparator in each cell. This 
solution is costly in terms of circuit area, and reduces the 
number of tuple comparisons that can be done in parallel, 
but it guarantees good performance in all cases. We also 
used a reduced instruction set of high-level instructions, each 
executed in one cycle, to contribute to this purpose. 

Data structures 

The RAPID circuit 25 ’ 26 recognizes two types of data struc¬ 
tures: tuples and logical formulas. Data are memorized in the 
circuit as logical formulas while keys are broadcast to the 
processing elements as tuples. We shall explain here tire 
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mapping between tuples and logical formulas. 

Tuples are variable-length strucrtires, with an arbitrary num¬ 
ber of attributes, or fields, each with an arbitrary length. The 
only limitation to the attribute or tuple length is the total 
memory size of the circuit, which may be composed of an 
arbitrary number of components. The internal data represen¬ 
tation is the byte string. The comparison between numbers 
in two complements is mapped into a byte-string compari¬ 
son by changing the appropriate bit. This operation is pres¬ 
ently done by software. A similar approach is used for real 
numbers. 

Each tuple is referenced by a 32-bit identifier, which may 
be a pointer, a link (a pointer on a pointer), or an identifier 
that does not change when the tuple is updated. The tuple is 
stored in the circuit as a logical formula, as we describe later. 
The tuple identifier then becomes the identifier of the corre¬ 
sponding subexpression. This identifier is then stored in a 
specific register of each cell participating in the subexpression 
evaluation. When the circuit compares a key tuple with a set 
of tuples stored as subexpressions, the result may be the 
sequence of identifiers of the subexpressions that are satis¬ 
fied after their instantiation by the key tuple. 

The circuit also recognizes logical formulas. These are com¬ 
posed of subexpressions that are, in turn, composed of predi¬ 
cates. Predicates are connected by the functions And or Or, 
according to the priority used in the logical formula, or by 
the special function Then (a variant of And), that we will 
explain later. The complementary function connects 
subexpressions. The circuit can either evaluate the global 
value of the logical formula or evaluate in parallel each 
subexpression. The result is either the set of identifiers of 
satisfied subexpressions or the quantity of these 
subexpressions. The global (Boolean) value of the formula is 
still available in any case. The evaluation of a fuzzy logical 
formula is possible, through the return of the weight of each 
satisfied subexpression. Such fuzzy, or vectorial, evaluation 
can rank documents or paragraphs by value. 27,28 The circuit 
can also take into account (or not) the order of retrieved 
words in a sentence or document. 

Predicates are either attribute-operator-constant, or attribute- 
operator-attribute. Each one is usually evaluated by a distinct 
processing element. When an argument is an attribute, it is 
represented by an attribute number stored in a specific loca¬ 
tion of the local memory of the processing element (PE). 
When a PE must compare an attribute with a constant, it 
waits for the arrival of the attribute number of the compari¬ 
son, which usually follows the end-of-attribute (FA) signal 
corresponding to the previous attribute or the new-tuple (C3) 
signal. If the information on the data bus is equal to the 
predicate attribute number, the PE begins an evaluation phase. 
If not, the PE remains idle until the next attribute number 
arrives. The proportion of PEs used in parallel varies accord¬ 
ing to the query; however, in most operations it is 100 percent. 


Comparison operators include <, < >, >, =, Included in, 
Contains, Strictly included, and Strictly contains. The predi¬ 
cate “argl Included in arg2” corresponds to the search for a 
nonanchored byte string within another byte string. Text 
attributes can be divided into words that are separated by 
punctuation characters defined by the user. Inclusion and 
comparison predicates may be completed by the prefix in¬ 
formation, which means that the retrieval is anchored at the 
beginning of an attribute or a word but may end before the 
end of the word. 

There is also suffix information, corresponding to the fact 
that the character string does not necessarily begin at the 
beginning of an attribute or word, but that it ends at an 
attribute or word end. Specific word information indicates 
whether word separations are taken into account for a given 
predicate. 

Subexpressions are defined by 

subexp = predicate I subexp connect predicate, with 

connect = And I Or I Then. 

Then is a variant of And. It corresponds to one of the follow¬ 
ing: 

• a succession of nonanchored retrievals in the same 
attribute; 

• the comparison of tuples according to a succession of 
sorting criteria; or 

• the successive comparison of fractions of an attribute 
and of a large (more than 32 bytes) constant. 

In some cases, comparisons are made on calculated 
results. As an example, we can retrieve the names of people 
who earn at least 1.1 times more than the average salary of 
their region. (See sidebar on associative memory capacity.) 
In this case, the software evaluates a virtual attribute, which 
receives a number (for example, attribute 4), such that, 
Attribute 4 = 1.1 x Attribute 3. This attribute is stored as a 
predicate constant and/or broadcast as a key in place of an 
ordinary attribute. 

Instruction set 

Figure 1 (on page 26) gives the instruction set of the RAPID- 
1 component. Instructions are stored in a codop (operation 
code) field and specified by an operator code, Oper, and by 
complementary bits. One instruction word is associated with 
each predicate, or more generally, to each PE. The circuit is 
controlled by data. No instruction is needed at evaluation 
time. 

A control word (10 bits) broadcasts with every data word 
(16 data bits, plus 6 special bits). 25 Therefore, the writing of 
data into the circuit makes full use of the usual 32-bit word 
width. 
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SELECT 

JOIN 

PROJECTION 

SORTING 

SEMI-JOIN 

SET OPERATIONS 

NOP 

RESERVED 


Figure 1. Instruction set. 

This instruction set should evolve in the next version by 
the suppression of the SEMI-JOIN and SET OPERATIONS 
instructions, which perform the same function as selection. 
Two important operations, EXTERNAL SORT and AGGRE¬ 
GATE with GROUPED BY, are executed by using the JOIN 
and PROJECTION operations. We briefly present the first four 
instructions. 

SELECT. This two-phase operation loads and evaluates 
the selection expression. It returns a Boolean result for each 
evaluated tuple. The circuit overflow is not presently admit¬ 
ted. The SELECT query must then be decomposed into sev¬ 
eral queries. 

JOIN. This is also a two-phased operation that loads tuples 
from one hashing bucket as a sequence of logical 
subexpressions, and evaluates the other hashing bucket. 
Predicates load sequentially. After a predicate has been stored 
in a PE, the PE transmits control to the next PE. For each key 
(which belongs to the second hashing bucket), the operation 
returns the sequence of tuple identifiers of the first hashing 
bucket that satisfies the JOIN condition. At the end of each 
tuple evaluation, to inform the PEs that none of them should 
wait to write its identifier on the bus, a specific C3 signal 
ends the identifiers’ write phase. This signal can also inter¬ 
rupt the transmission of the identifier sequence. 

This last mechanism is used in the EXTERNAL (distribu¬ 
tive) SORT operation, which uses the JOIN instruction. The 
results of the external sort operation are the identifiers of the 
sorting limits. He et al. studied efficient methods for the choice 
of limits. 29 

PROJECTION. The projection operation is a single load- 
evaluation phase. The new tuple from the cmrent hashing 
bucket loads into the first free PE, and is compared at the 
same time with tuples loaded as a logical formula in other 
PEs. The representation of a set of tuples as a logical formula 
for a projection is given in Figure 2, where (a(i), ..., a(i+n)), 
(b(i), ..., b(i+n)), and so on, are the values of tuples that are 
not duplicates. If the value of the global logical formula is 
false, the current tuple is not a duplicate. In this case the 
software kernel stores the tuple identifier in the result rela¬ 
tion. The control for the loading of the next tuple transfers to 
the next free PE. If the value of the logical formula is true, the 


Att i = a(i) 

and 

Att i+n = (i+n) 

or 

Att i = bi and 


Att i+n = b (i+n) 

or 


Figure 2. A set of tuples as a logical formula. 

key tuple is a duplicate. The identifier of the logical 
subexpression satisfied by the key tuple is returned by the 
circuit. This result is mainly used when using the PROJEC¬ 
TION operation for the calculation of an AGGREGATE with 
GROUPED BY. The end-of-formula token is not moved, and 
the PEs used to store the previous tuple are reused for the 
next one. This operation is fully managed by the sequencers 
of the relevant PEs, without using an instruction. 

These mechanisms can also be used to execute AGGRE¬ 
GATE with GROUPED BY operations, such as the average 
salary per region in the people {name, region, salary) rela¬ 
tion. (See box on JOIN example.) To calculate this aggregate, 
we build an array in which each line corresponds to a class. 
Columns describe the name of the class (here the region 
name), the intermediate calculation of the class (here the 
class cardinality and the total amount of salaries), then the 
final result for the class. Classes are created when new tuples 
are evaluated and correspond to new classes. Therefore the 
array is not ordered by class names. A sort on this column or 
any other may of course be done at the end of the operation; 
it is generally a small one. If the class is defined by an inter¬ 
val, the class number is calculated during a virtual attribute 
calculation step, before the aggregate operation itself. This 
solution makes possible a fast aggregate execution (due to 
the efficiency of hashing) and an economical one (no spe¬ 
cific instruction). However, it increases-in a limited way-the 
projection cost. 

INTERNAL SORT. The arguments of INTERNAL SORT are 
a set of tuples (small enough to fit into the accelerator) and a 
specific qualification expression (a list of attributes of ascend¬ 
ing or descending sorts). The operation takes place in two 
phases: a load phase, in which the set of tuples is loaded into 
the circuit as a logical formula, and an evaluation phase, in 
which each tuple of the same set is successively used as a 
comparand. (See Figure 3.) For each comparand tuple, the 
associative circuit returns the tuple rank versus the other tuples 
of the set. The operation result is a rank array that is very 
comparable to that of the SORT operation in APL, 30 though it 
can include ex aequo, that is to say, tuples with the same 
rank. It is easy to use it to get a rank array without ex aequo, 
then physically reorder the tuples in sort order, but this is not 
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JOIN example with the RAPID circuit 


Let us assume we have relations People (name, region, 
salary) and Region (name, average salary). The second 
relation may result from an aggregate operation (with a 
grouped by clause) on the first one. We shall retrieve the 
people who earn more than the average salary of their 
region, by executing a join between those two relations 
with the qualification expression 

People.Region = Region.name and 
People.salary > Region.average salary 

Let us consider the following data: 


People 

# 

name 

region 

salary 


1 

Dupond 

Paris 

10 


2 

Durand 

Paris 

20 


3 

Anatole 

Caen 

20 

Region 

# 

name 

average salary 


1 

Paris 

13 



2 

Caen 

20 



These two relations may correspond to a hashing bucket 
with the same number of larger relations, with a hashing 
function on People.Region and Region.name. 

Let us assume we store the tuples from relation Region 


in the circuit as logical formulas. The qualification expres¬ 
sion will map them into the following logical formula, where 
subexpressions #1 and #2 will be evaluated in parallel: 


PEI 

ATT 2 = Paris 

And 

#1 

PE2 

ATT3 > 15 

Or 

#1 

PE3 

ATT2 = Caen 

And 

#2 

PE4 

ATT3 > 20 

Or 

#;2 


Attribute numbers stored into the predicates are those of 
the corresponding attributes of the people relation. Virtual 
attribute # is the tuple identifier. Subexpression #\ corre¬ 
sponds to tuple #1 from region. Connector Or is mainly 
used as a separator between subexpressions. 

In this example, tuple #1 from People, (Dupond, Paris, 
10) does not satisfy the logical formula. Tuple #2 (Durand, 
Paris, 20) satisfies the subexpression #1 and therefore sat¬ 
isfies the logical formula. The result, which is then deliv¬ 
ered by the circuit, is the following set of logical addresses: 
E = {#1, 0}, where identifier 0 means “end of result.” In the 
future, a single bit of the identifier should be used to signal 
the end of result while minimizing transfers. In the same 
way the result for tuple #3 is E= {#2, 0}. Data broadcast to 
the circuit are limited to the attributes that participate in 
the comparison (here, attributes 2 and 3). 


class_number = 1 

While the bucket is not empty 

do 

read the class_attribute of the current tuple 
store identifier = class number 
store the predicate <ATTRIBUTE = class_attribute> 
if result identifier = 0 
then begin 

store tuple identifier in result relation 
do aggregate calculus on class = class_number 
/*e.g., increment class cardinality 
class_number = class_number + 1 
move token in the PEs array 
else 

send signal C3 (“send no more identifiers”) 
do aggregate calculus on class = identifier 
end 
end 


Figure 3. Aggregate algorithm. 


always needed. It may be sufficient to access some elements 
of each tuple in sort order. 

The tuple sets used for internal sorting are buckets result¬ 
ing from a distributive sort. From the rank array of each bucket, 
it is usually necessary to calculate the ranks array versus the 
whole relation. This is done by adding a constant to the 
ranks of each bucket. 

During the loading phase of each sort bucket, sort attributes 
are stored in the circuit as predicates (for example, as 
“attribute < constant” in the case of an ascending sort 
attribute). Successive sort predicates are connected by a Then 
connector. If successive sort attributes correspond to the same 
sorting sense, it is possible to compact them in the same PE, 
in the limits of the memory capacity of the PE. During the 
evaluation phase, the rank calculation is a count of the num¬ 
ber of satisfied subexpressions. In the case of an ascending 
sort, the number of satisfied expressions would be the num¬ 
ber of inferior tuples in sort order. He et al. simulated param¬ 
eters for an optimization of external and internal sort using 
the RAPID circuit. 29 
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Architecture 

The RAPID accelerator (Figure 4) is 
an associative circuit that interfaces with 
the host machine as a memory. Figure 
5 shows it is composed of three main 
functional blocks: the array of associa¬ 
tive cells or PEs, the subexpressions 
resolution block, and the query resolu¬ 
tion block. In contrast with other asso¬ 
ciative circuits such as Ogura’s and the 
Data Base Accelerator, it does not have 
a bit-management block, because it does 
not execute bit serial operations. The 
subexpression resolution block is a spe¬ 
cific logical block that executes Ands, 
Ors, or Thens between predicates. Its 
operation is detailed later. Most asso¬ 
ciative circuits do not include this type 
of functional block, but Ogura et al. 22 
proposed a functionally similar struc¬ 
ture, and Penazola and Ozkarahan 31 
proposed a seemingly similar one. The 
query resolution module includes the 
And block, the Or block, the counting 
block (“+” block), and the arbitration 
block. 

The arbitration block is the classic 
multiple-match resolution circuit, which 
is a characteristic of associative memo¬ 
ries. (See Figure 6.) As this block is 
mainly implemented using a single, pro¬ 
grammable component, and the length 
of the circuit in gates is a logarithm of 
the number of components, its delays 
are not very dependent on the number 
of PEs or of components, which is usu¬ 
ally between eight and 16. 

The And block calculates a logical 
And between ends of subexpression 
values, which enables an evaluation of 
logical formulas in Or priority. The Or 
block executes a logical Or between the 
same values, which is used for the evalu¬ 
ation of logical formulas in And prior¬ 
ity. As this representation normally 
represents tuples as a logical formula, 
the Or block can also check if one tuple 
at least satisfies the comparison condi¬ 
tion with the key tuple. The counting 
block counts the number of true 
subexpressions, which is used for the 
internal sort operation. 

Processing elements are all at the 



Figure 4. RAPID board. 


System 
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Subexpression 
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Query 
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Figure 5. Block diagram of the accelerator. 
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(a) 


same address for writes and reads. 

Writes and reads are done according to 
a protocol. Data control the circuit. 

Management of the communication pro¬ 
tocol with the host machine bus is at 
the board level, to make the circuit 
independent from a given host machine. 

The board usually produces a clock 
cycle only if the host processor has 
written or read data in the circuit at the 
end of the previous cycle. However, the 
circuit is synchronized with the host 
machine. 

The associative circuit may be imple¬ 
mented into one or several main com¬ 
ponents, each including several PEs and 
part of the other functional blocks. With 
more than a few tens of PEs per 
machine, the performance of the accel¬ 
erator becomes independent from its 
number of PEs, except for very large 
relations (more than one million tuples). 

The subexpression resolution block is 
fully distributed on the main compo¬ 
nents, based on one resolution module 
per PE. Each resolution module cas¬ 
cades with both the previous and next 
ones. The part of the query resolution 
block common to several main compo¬ 
nents is presently constructed with off- 
the-shelf components. 

Processing element. The PE is the 
basic cell of RAPID. It executes the 
embedded operations and memorizes 
data. The external view of the PE in Figure 7 displays its 
interconnections with its environment. These connections 
include the internal data bus, command bus, control signals, 
connections with the previous and next PE, and pins signal¬ 
ing the relative position of the PE in the PE array. Table 1 on 
page 30 lists descriptions of the corresponding signals. 

An internal view of the PE (Figure 8 on page 3D reveals an 
operative part and a control part. The operative part includes 
a local memory with 18 words for the storage of a 16-word 
operand and two attribute numbers. Each word is 16 bits 
wide, and an attribute number also has 16 bits. The operative 
part also includes a logical comparator, a predicate generator 
for PRED values, and several registers. The registers include 
two for the external and internal configuration of the PE (STA- 
TUS1, which holds the instruction word and is read/write, 
and STATUS2, which holds the PE state and is read-only), 
two for the interface with the PE environment (INOUT and 
MCR), and a fifth for the storage of the tuple identifier (IDENT). 

The control part is the PE sequencer, which is divided into 









I2j—1 
F2j—1 

(b) 


I2j 


F2j 


Figure 6. Arbitration block, from Anderson 32 (a); close-up of node (b). 
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Figure 7. External connections of the PE. 


three automata. The test automaton (AUTOT) enables a very 
fast access to the operand, configuration, and internal regis¬ 
ters during normal operation. The primary automaton 
(AUTOP) schedules the major part of the PE’s operations. 
The auxiliary automaton (AUTOA) helps the main automa¬ 
ton during the loading phase. 

Subexpressions resolution. The subexpressions resolu¬ 
tion (Figure 9 on page 3D implements a chained And, a 
chained Or, and a chained Then. MON_in and MON_out are 
the subexpression values, CONX the connector value, PRED 
the predicate value, and FINMON the result. We have 
detailed the subexpression in earlier papers. 25,26 The response 
time of the subexpressions resolution circuit equals the propa¬ 
gation time in the logical groups corresponding to the maxi¬ 
mum size of a subexpression. We can determine the response 
time by considering the propagation time on all modules, or 
by chosing a maximum size for a monomial or subexpression. 

An important problem in the circuit design is the process¬ 
ing of large predicates, in which the constant overflows the 
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Table 1. External PE signals. 


Command signals 

Cl: starting a loading phase at next cycle 
C2: starting an evaluation phase at next cycle 
C3: broadcast of a tuple to be evaluated at the next 
cycle 

FA: end of an attribute 
FR: end of the relation 
FTUP: end of a tuple 
FMOT: end of a word in a text 
ADS: presence of a PE address on the data bus 
CTRL0//T1: opening parenthesis in the normal mode//a 
bit of the test instruction code in the test mode 
CTRL1//T2: closing parenthesis in the normal mode//a bit 
of the test instruction code in the test mode 

PE's chain signals 

MON_in/MON_out: partial subexpression value 
propagation 

Suite/Cont: control transmitted from one PE to another 
to make it the current PE 

SGLI/ACTIGLI: activation of the PEs that contains the 
continuation of a nonanchored search by one PE 
SUIGLI/PUIGLI: activation of the PE group containing the 
continuation of a nonanchored search by a group of 
PEs 

DVPE_in/DVPE_out: propagation of a tuple overflow 
signal in a loading phase 

Other signals 

OUMON: input pin, receives the result of the Or 
resolution tree 
RE: PE initialization input 

HAI: PE entry on the PAUSE mode (especially on the test 
mode) 

Wl: output pin, bus capture by the PE 
Error: output pin, error detection by the PE 
PPE: fixed pin, used to distinguish the first PE of the PEs 
vector 

DPE: fixed pin, used to distinguish the last PE the PEs 
vector 


local memory of one PE. Several solutions have been pro¬ 
posed for that problem, including an And of comparisons on 
several operand parts identified by an index, 22 the addressing 
of a group of cells corresponding to an address where low- 
weight bits are masked, 23 or the transfer of a control token 
from one byte to the next in ISSP. 20,21 We pass control from 


one PE to the next when the comparison on the current 
operand part ends, and when the comparison result is equal¬ 
ity. The corresponding predicate value is then forced to the 
neutral value for the subexpression resolution, true-in-And 
priority. The only significant value of PRED is that of the first 
PE that remms an inequality comparison or the value of PRED 
in the last PE. The PRED values for the next PEs in the same 
predicate—if any exist—remain at the neutral value. 25 

Rapid versions. We designed version VO to prove the 
circuit feasibility. It implements all relational operations 
extended to sorting. It was customized using a Silvar Lisco 
CAD software. A component contains a single PE, in a 68-pin 
CLCC package. Fifteen of the pins are free for further ver¬ 
sions. Table 2 gives the characteristics of this version. 

With a host machine powerful enough to saturate the 
accelerator, VO executes a 1,000 x 1,000-tuples JOIN in 4 ms. 
However, the limitation of the total number of PEs on an 
actual board implies an important overhead in the case of 
large relations. 

Implementation of VO effectively began in 1988, but stopped 
in 1989 due to the departure of the logic designer. At this 
time, we had only a partial simulation. Checks at logic and 
layout levels were resumed in spring 1990 by a PhD student. 
The ES2 Company fabricated it in October 1990. 

We designed most of VI from scratch. VI. 1 is compatible 
with the SQL-1 standard and implements four PEs on each 
component. A practical board with eight main components 
may then include 32 PEs, enough for current performance 
standards. We designed this version with standard cells and a 
1.3-micron technology, using the Cadence CAD package. 

VI. 1 includes text retrieval, detection, and processing of 
null values according to the SQL-1 standard and the process¬ 
ing of overflows. Modifications of the PROJECTION opera¬ 
tion enable it to use the AGGREGATE with GROUPED BY 
function, according to the method described earlier. We also 
improved testability. We added scan paths, which make it 
possible to separately scan four parts of the circuit. Using the 
test automaton, we can rapidly access the internal states of 
the circuit during normal operation and therefore run fast 
test programs. The effective version design started in sum¬ 
mer 1990 with one PhD student, and we expect the circuit to 
be fabricated by the end of 1991. 

Environment and performance 

With the RAPID accelerator, access to data, hashing, data 
transfer to the accelerator, and results transfer into result rela¬ 
tions, are done by a software kernel executed by the host 
processor. This kernel is composed of a multiuser transac¬ 
tions manager that manages threads (lightweight processes), 
a query machine that executes query operations, and an 
object manager that does object creation and access. The 
thread management by the software kernel guarantees a low- 
cost, atomic execution of operations by the accelerator, that 
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Figure 8. Internal view of the processing element. 


major factor of RAPID’s efficiency. 

We developed the first version of the 
software kernel on a Sun 3/50, with a 
recent object manager. We measured the 
duration of the 1,000 x 1,000-JOIN (a JOIN 
of 1,000 tuples per 1,000 tuples returning 
1,000 result tuples) in the software configu¬ 
ration that corresponds to the use of the 
accelerator. We have not used the accelera¬ 
tor board in the Sun workstation. (One ver¬ 
sion of the board has been implemented in 
a Macintosh II.) The software operation time 
is 1.23 seconds. The product of this time 
and the number of MIPS of the machine 
(about 2.5) is better than the throughput of 
the Gamma machine, which is a well-known 
parallel machine that uses about 7 MIPS to 
execute this operation in memory. 33 The 
time lapse in the accelerator is very small, 
about 4 ms with the first version of the 
circuit. 

We are improving the performance with 
a novel, performance-oriented, object man¬ 
ager. We have completed the design of this 
object manager and estimate it will reduce 
the access time by approximately one 
order of magnitude in an average case. This 
object manager is a critical module. By 
using it and some complementary software 
improvements, the software duration of the 
reference operation should drop to about 
200 ms, while the time spent in the hard¬ 
ware should be about 2.38 ms with the VI 
version. 


FINMON* 


FINMON+ 




(a) 


(b) 


Figure 9. Subexpressions resolution: And priority (a); Or 
priority (b). 


is to say, an execution in which a suboperation is never pre¬ 
empted. This atomicity is needed for performance reasons. 
The accelerator executes the operations. However, in the 
case of selections, preselections of the most selective attribute 
are done using an index. Software kernel performance is a 


Figure 10 on page 32 shows the improve¬ 
ment ratio at a given number of MIPS, based 


Table 2. Features of RAPID V0. 

Configuration 

Number of PEs 
x 16 words x 22 bits 

Instruction set 

7 instructions 

Cycle time 

120 ns 

Supply voltage 

5 V 

VLSI process technology 

2-pm CMOS with double 

Number of devices 

aluminium layers 

11,000 

Chip size 

5.3mm x 5.2mm 

Number of pins 

50 

Package 

68-pin CLCC 
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Figure 10. Improvement ratio vs. host processing power (a); accelerator utilization ratio vs. host processing power (b). 


on our assumptions. We assume the host machine through¬ 
put in Mbytes equals the number of MIPS. But the through¬ 
put needed for the circuit I/Os is much smaller, about 10 5 
bytes/s x MIPS. Clearly, the accelerator can significantly im¬ 
prove performance even on a 200-MIPS host machine. 


The RAPID-1 is an associative accelerator for 

database operations. Its instruction set includes relational 
operations extended to SORT and AGGREGATE with 
GROUPED BY. Data do not remain in the associative circuit, 
except for very short durations. Programming is a mere trans- 
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lation of query predicates, though an adequate software ker¬ 
nel is needed for efficiency. 

The circuit uses a small number of associative cells with 
hardwired control. It speeds up database systems by a factor 
of five, versus most referenced database systems. The accel¬ 
eration ratio will still be very significant with the emerging 
class of new microprocessors, which may have a peak power 
about or above 100 MIPS. The accelerator will also speed up 
parallel database systems using these processors. 

These results show that the choice of an associative circuit 
with hardwired control, that is controlled by data, with bit- 
parallel comparisons for all operations, is critical for the use 
of the accelerator with new generations of processors. Other 
solutions are possible but would lead to machines with weaker 
performance (due to multiple-instruction transfers) and less 
homogeneity when inequality predicates are used (such as 
for sorting). 

We hope to see this accelerator evolve further to include 
the processing of binary strings, fault tolerance, a feasible 
performance increase by a factor of 10 to meet the needs of 
the next generation of microprocessors, and the design of 
components with more parallelism. (P 
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A Fine-Grain Architecture for Relational 
Database Aggregation Operations 


This design and simulation of a bit-sliced processor for relational database aggregation func¬ 
tions addresses an important, computationally expensive problem in database computers. 
The slice processor takes two tuples as inputs (one bit at a time) and returns two bits as 
outputs every clock cycle. A larger aggregation unit uses a number of identical slice proces¬ 
sors, connected according to odd-even network topology, to achieve improved performance 
on a parallel pipelined processor. 
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ince its introduction by Codd in 1970, 1 
the relational database model has 
achieved wide acceptance. While 
allowing a high degree of data inde¬ 
pendence, the model provides a theoretical means 
to deal with consistency and redundancy prob¬ 
lems. This is possible mainly because data is rep¬ 
resented in two-dimensional tables that can be 
designed using the normalization theory. 

In the relational database model, data can be 
manipulated through a small but powerful set of 
operators. These operators are the relational 
algebra operations (such as PROJECTION, JOIN, 
and SET) and aggregation operations. Most of 
these operations are computationally intensive 
and algorithm design for efficient implementa¬ 
tion plays an important role in the design of 
database computers. Efficient processing has a 
determining effect on system performance. As a 
result, the operations have been the subject of 
intense study in the development of relational 
database management systems. 

For instance, to maximize the performance of 
the natural JOIN operation, several algorithms 
have been developed. 2 While these methods 
tended to be effective in conventional von 


Neumann computer systems, a closer look 
reveals great opportunity for concurrent process¬ 
ing. With appropriate hardware and system or¬ 
ganization, such concepts as parallelism and 
pipelining can significantly enhance the perfor¬ 
mance of these operations. 

We designed and simulated a special-purpose 
unit for such aggregation functions as Sum, Count, 
and Average using a one-bit slice processor as a 
basic component. (We do not consider the re¬ 
maining two statistical aggregation functions Min 
and Max here as they are not computationally 
intensive.) A larger unit can be easily designed 
by connecting several identical slice processors 
according to odd-even network topology. 3 In this 
unit the tuples are input, processed, and output 
in parallel, one tuple at a time. The processing 
elements operate on tuples bit by bit. As a result, 
we refer to this system as parallel bit-level pipe¬ 
lined architecture. 4 

The main advantage of this approach is that 
the memory requirement of each slice processor 
is very small and is independent of input size. 
Since input operands process one bit at a time, 
we reduce the amount of hardware in each slice 
processor. As a result, a large number of slice 
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processors can be integrated on one VLSI chip. The pro¬ 
posed design exhibits several attractive features. 

• A simple, basic slice processor. The system makes 
use of only one type of simple slice processor. Each 
slice operates on data bit by bit, simplifying design and 
verification of the circuit. 

• Overlap of data I/O and processing. Data processing 
time is completely overlapped with data input and out¬ 
put to and from each slice. 

• Static interconnection network. For ease of imple¬ 
mentation, the design uses a static interconnection 
between the different slices rather than a dynamic net¬ 
work. This connection allows the system to process sev¬ 
eral data streams in a pipelined manner. 

• High throughput. Several bit-serial slices operating in 
parallel achieve high throughput. 

• Low pin count. The input and output of operands one 
bit at a time, to and from the slices, ensure a low count. 

• Design is independent of tuple size. 

• Limited interconnection requirement. Bit-serial com¬ 
putation limits interconnection needs. 

Output of the tuples is completely overlapped with their 
input and processing times. Therefore, a tuple stream input 
to the unit processes in a pipelined way, one-bit per tuple at 
a time. Several different input streams of tuples can also be 
processed in a pipelined manner. That is, the processing of a 
new input stream of tuples can be initiated while the previ¬ 
ous stream is still being processed. Consequently, the pro¬ 
posed unit achieves pipelining at both the tuple and input 
stream levels. 

We use a SIMD (single-instruction stream, multiple-data) 
architecture to perform statistical aggregation functions. In 
contrast to the parallel pipelined query processor approach 
in Kim et al., 5 sorting, aggregation, and duplicate marking 
are performed concurrently. One network topology and one 
type of processing element (slice) implement these functions. 
The design in Kim et al. uses three different network topolo¬ 
gies and processes parallel tuples one bit at a time. 

The processor 

Let’s examine the design of our Sum processing unit. The 
same unit will be used to perform the COUNT and AVER¬ 
AGE operations. 


T i = (Ca i ,Mb',Cg i ) 
Tj = (Cai,Mbi,Cgi) 


Slice 

processor 


OLi 

ol 2 


Figure 1. A slice processor. 


Procedure SUM ((Ca^Mb.Cg 1 ), (Ca i ,Mb l ,Cg j )) 

/* Two reduced tuples T, = (Ca'.Mb'.Cg 1 ) and Tj= (Ca j ,Mb i l Cg i ) 
are input to a slice V 
begin 

if Cg' > Cg j then 

begin OL, := (C^.Mb'.Cg 1 ); 01-2:= (Ca',Mb',Cg’); 
elseif Cg'< Cg' then 

begin OL, := (Ca'.Mb.Cg'), 0L 2 := (Ca'.Mb.Cg'); 
else 

if Mb' AND Mb' = 1 then 

OL, := (Ca' + Ca',Mb',Cgi); 0L 2 := (-.O.Cg 1 ); 
elseif Mb' AND Mb j =1 then 

OL, := (Ca'.Mb'.Cg*) ; 0L 2 := (-,Mb,Cg*); 
else 

OL, := (Ca'.Mb.Cgty ; 0L 2 := (Ca',Mb',Cg'); 
endif; 

endif; 
end; 


Figure 2. Procedure Sum algorithm. 


(C a ,M b ,Cg) 

T 1 = (2,1,3) - 

Slice 

processor 

T 2 = (2, 1, 3) -- 


(C a ,M b ,C g ) 
OL, = (4, 1,3) 


OL 2 = (-, 0, 3) 


Figure 3. Example of a Sum algorithm. 


Methodology. To each reduced tuple, we associate a mark 
bit we call Mb. Each one-bit slice processor takes as inputs 
two reduced tuples (see Figure 1). The slice processor com¬ 
pares the two Group-By values (Cg', Cg’). It manipulates the 
aggregation values (Ca 1 , Ca') and mark bits (Mb 1 , Mb') accord¬ 
ing to the result of the comparison phase. The resulting re¬ 
duced tuples are output through OLj and OL 2 . The mark bits 
identify the qualified tuples, which are output with a mark 
bit set equal to 1. Thus, the processor filters out duplicate 
tuples by examining their mark bit. Figure 2 shows the algo¬ 
rithm performed by the slice processor. 

An example appears in Figure 3 in which two tuples (Tj 
and T 2 ) are input to the slice processor. These tuples are 
both qualified (their mark bits are both 1) and have the same 
Group-By value. Therefore, their aggregation columns will 
be summed, and one of the tuples will be disqualified by 
resetting its mark bit to 0. 

Hardware implementation. Figure 4 depicts a block dia¬ 
gram of a one-bit slice processor that contains five flags: F u 
F 2 , F 3 , Fj, F|. The two reduced tuples are input, processed, 
and output one bit at a time. Processing starts with the Group- 
By values Cg m = {Cgk m , Cg2 m , ..., Cgl m } m = i,j, which are 
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Figure 4. Block diagram of a slice processor. 


input to the slice processor bit by bit starting from the most 
significant bits, or MSBs, (Cgk\ Cgk>). (Here, k represents the 
number of bits in the binary representation of the Group-By 
values.) This makes the bit-serial comparison of the two 
Group-By values easier to perform. 

Next, mark bits Mb 1 and Mb' are input, followed by the 
aggregate values Ca m = (Cal m , Ca2 m , ..., Cap" 1 }, m = i,j. Aggre¬ 
gate values must be added serially, bit by bit, starting from the 
least significant bits, or LSBs. Therefore, unlike the Group-By 
values, the aggregate values are input starting from the LSBs 
(Cal 1 , Cal'). The number of bits in the binary representation of 
the aggregate values is p. Notice that the Group-By values com¬ 
parison, the manipulation of mark bits, and the summation of 
aggregate values completely overlap with the input and output 
of the reduced tuples to and from the slice processor. 

Control and synchronization are achieved through two con¬ 
trol signals S l and S 2 . S l5 a start signal, is applied to the slice 
processor at time instant ^ for one clock cycle. During this 
time instant, the slice processor resets its flags. At the next time 
instant, the slice processor begins the processing of the 
reduced tuples present at its two inputs. 

Control signal S 2 indicates the completion of Group-By val¬ 
ues comparison. This signal is applied to the slice processor at 
time instant t k for one clock cycle. At the same time instant, the 
two mark bits are input to the slice processor. From time in¬ 
stant t k+1 to t k+p , the aggregate values are input to the slice 
processor. 

The synchronous serial adder computes the sum (Sum = Ca 1 
+ Ca'). CF denotes the adder carry flag (not shown in the block 
diagram). Flags Fj and F f store mark bits Mb 1 and Mb'. Control 
signal S 2 sets flag F 3 , which indicates that Group-By values 


comparison has been completed. Finally, flags ¥ l and F 2 store 
the outcome of the comparison according to Table 1. 

The Bit Manipulation Circuit (BMC) determines the two out¬ 
puts OL m = (Cgk m Cg2 m ,..., Cgl m Mb m Cal m Ca2 m ,..., Cap" 1 ) and 
m =1,2. The Bit Serial Sum algorithm (see the box) computes 
each output bit. The slice processor starts the output process 
one clock cycle after the input process begins. Thus, the out¬ 
put process completely overlaps both the input and process¬ 
ing times. Note that the hardware algorithm is independent of 
the tuple size. Because tuple size varies from one application 
to another, this characteristic is caicial. As shown in Figure 5, 
by controlling the time interval between control signals and 
S 2 , the slice processor can be tuned to a particular tuple size. 

We designed a slice processor using 10 D flip-flops and 
about 72 gates. Figures 6 and 7 show an example (with p = 2 
and k = 4) and its simulation. In this example, the simulation 
starts at the trailing edge of the first appearance of S v The 
output process starts one cycle after the input process begins. 

Implementing other aggregation operations. The 
Count and Average functions can be implemented using the 
Sum unit. For the COUNT operation, each reduced tuple will 
be input to the Sum unit with a Ca value set equal to 1. Query 


Table 1. Flags F 1 and F 2 . 
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Figure 5. Timing diagram. 


Ti=(2,1,10) 

Tj=(1, 1, 10) 


Slice 

processor 


OL, = (3,1,10) 


OL 2 = (-, 0, 10) 


Figure 6. Example of a one-slice processor operation. 
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The Bit-Serial Sum Algorithm 

Procedure Bit-Serial SUM (Ti, Tj ) 

/ * S2 is now high for one clock cycle. During this 

/ * Two tuples T, = (Cgk( ..Cg2' Cgl' Mb' Cal' Ca2', ..., 

Phase (called mark bit manipulation phase) the two 

Cap') and T y = (Cgk/ ..., Cg2 7 Cgl 7 Mb' Cal 7 Ca2 7 , ..., Cap 7 ) 

mark bits are stored (in Fi and Fj) and a mark bit ma- 

are processed one bit at a time by the slice processor. The 

nipulation is performed. The phase duration is one 

outputting of the result is completely overlapped with the 

clock cycle*/ 

inputting process. The slice processor two outputs are 01^= 

F, Mb' ! Fj : = Mb'; CF := 0 ; 

(Cgk” ..Cg2'" Cgl” Mb'"Cal” Ca2”,..Cap”), m =1.2 */ 

if (F, = 0) then M b ] := (M b j or M b ') ; M b 2 := 0 ; 

if Sj = true then begin / ’Initialize flags (the duration of S, 

else M b ' := M,, 1 ; M„ 2 := M b '; 

is one clock cycle)*/ 

endif; 

FI 0 ; F2 := 0 ; F3 := 0 ; Fi := 0 ; Fj := 0 ; n :=k ; 

m := 1 

while S 2 = false and n > 1 loop / * Start the GROUP BY 

/ * Now the summation phase begins. This phase 

comparison phase */ 

will last until a new start signal (S,) is applied to the 

if (Cy > Cy) then 

processor*/ 

if (FI = 0) then C g „’ Cy ; C s „-’ C^; FI := 1; 

while S, = false and m < p loop 

else if ( F, = 0) then C g „> = C gn ‘ ; C gn 2 := CJ ; 

if (Fl = 0) and (Fi = 1) and Fj = 1) then /* refers 

plop r 1 .= r i • C 2 •= C * • 

CISC v-. gn . v^ gn , v_,gn • ^gn > 

to don’t care*/ 

endif ; 

Cam 1 := Cam 1 XOR Cam' XOR CF ; Cam 2 := -; CF 

elseif (Cgn 1 < C gn ') then 

:= Cam' and Cam' ; 

if (F, = 0) then cy CJ ; := cy ; FI := 1; 

elseif ( (Fl = 0) and (Fi = 0) and (Fj = 1 )) or (F2 

F2 := 1 ; 

=1) then 

elseif ( F2 := 0) then C gn ] := C gn j ; C gn 2 := C gn ' ; 

Cam 1 := Cam 1 ; Cam 2 := Cam 1 ; 

pIsp r 1 ■= C i- C 2 •= C 1 ■ 

else Cam 1 := Cam' ; Cam 2 := Cam' ; endif ; 

endif ; 

m := m + 1 ; 

picp r i .= r >. r 2 •= C > • 

CISC V_<g n . V-. gn , V-.gfj . Vj gn , 

end; 

endif; 

endif; 

n := n -1 ; 

end Bit-Serial SUM ; 

end; 



3 in the Aggregation Functions box illus¬ 
trates the AVERAGE operation, which is a 
combination of both Sum and Count. The 
AVERAGE operation proceeds as follows. 
To each reduced tuple (Ca,Mb,Cg), we 
attach a new attribute value Cc. This at¬ 
tribute value is set initially to 1. The aug¬ 
mented tuple (Cc,Ca,Mb,Cg) is input one 
bit at a time to the Sum unit. The Cc value 
is input immediately following input of 
the Ca values. The first (k +1) bits output 
from the Sum slice processor represent a 
Group-By value followed by a mark bit. 
The next p bits output from the unit rep¬ 
resent a Ca value. Finally, the remaining 
output bits represent a Cc value. 

Designing a Sum processor 

A one-bit slice processor can process 
two tuples one bit per tuple at a time. 


Clock 

51 

5 2 

Fi 

F 2 

f 3 

F; 


Sum 

Carry 

OL, 

OL, 


j i_i i_r 


i_ t~l 


j- l 


J L 

J L 


S L 


Figure 7. Simulation of the example in Figure 6. 
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Aggregation functions 


In database management applications, we often want to 
categorize the tuples of a relation by the values of a set of 
attributes and extract an aggregated characteristic of each 
category. We call these database management tasks aggrega¬ 
tion functions. For instance, the SQL data language includes 
the following built-in aggregation functions: Sum, Count, Av¬ 
erage, Min, Max. The attributes used for the categorization 
are referred to as Group By columns. 

Consider the relation professor as professor (faculty, de¬ 
partment, salary). Each tuple of this relation gives the name 
of a faculty person, the department, and the academic year 
salary. Table A lists an instance of the relation professor. 

In this relation a row such as <Smith, Electrical Eng., 
$39,000> is referred to as a tuple. There are no duplicate 
tuples. Consider the following queries: 

Query Is How many faculty are in each department? The 
result is to be ordered by department. 

This query requests a count of the number of faculty for 
each department. Faculty are therefore categorized accord¬ 
ing to the attribute department. As a result, department is 
referred to as a Group-By attribute. In SQL, the above query 
is formulated as follows: 


Table A. Instance of the relation professor 
{faculty, department, salary). 

faculty 


department 

salary 

Smith 


Electrical Eng. 

$39,000 

Joe 


Mechanical Eng. 

$35,000 

Susan 


Computer Sc. 

$36,000 

Erick 


Electrical Eng. 

$38,000 

Paul 


Electrical Eng. 

$37,000 

Johannes 

Computer Sc. 

$65,000 

Rick 


Computer Sc. 

$32,000 

Gerard 


Computer Sc. 

$43,000 

Kenneth 


Mechanical Eng. 

$40,000 


The result of the Sum aggregation function is a new rela¬ 
tion called payroll. 


department Sum (salary) 

Computer Sc. $176,000 

Electrical Eng. $114,000 

Mechanical Eng. $75,000 


Ql: Select department, Count (faculty) 

From professor 
Group By department 
Order By department 

The result of applying the Count aggregation function is a 
new relation with two attribute names. They are a Group-By 
attribute ( department , in this case) and a new attribute called 
Count. The tuples are ordered lexicographically in ascending 
order according to the Order-By attribute ( department ). The 
resulting relation, called Man-power, is 


Query 3: What is the average salary in each department? 
The result is to be ordered by department. 

The SQL formulation of query 3 is 

Q3: Select department , Average (salary) 

From professor 
Group By department 
Order By department 

Applying the Average operator to the relation professor 
yields the relation median-salary. 


department Count ( faculty) 

Computer Sc. 4 

Electrical Eng. 3 

Mechanical Eng. 2 


department 
Computer Sc. 
Electrical Eng. 
Mechanical Eng. 


Average ( salary) 

$44,000 

$ 38,000 

$37,500 


Query 2: What is the total salary for all faculty in each de¬ 
partment? The result is to be ordered by department. 

The SQL expression for this query is 


Note that the AVERAGE aggregation operation combines 
both the SUM and COUNT operations. Another way of rep¬ 
resenting the relation median is as follows: 


Q2: Select department, Sum (salary) 
From professor 
Group By department 
Order By department 


department 

Sum ( salary) 

Count (faculty) 

Computer Sc. 

$176,000 

4 

Electrical Eng. 

$114,000 

3 

Mechanical Eng. 

$75,000 

2 


continued on next page 
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Aggregation functions ( continued) 

The Average ( salary, ) for each department can be easily 
obtained by performing the following operation: 

Average {salary) = Sum {salary) / Count {faculty) 

Notice that the COUNT operation can be performed using 
the Sum operation by adding an extra attribute {extra) to the 
relation professor. For each tuple of the new relation denoted 
professor *, the extra attribute value is set to 1. An instance of 
the new relation appears in Table B. 

Query 4. The following Sum SQL statement is equivalent 
to Count statement Ql: 

Q4: Select department , Sum ( extra) 

From professor* 

Group By department 
Order By department 

These remarks are important from a hardware implemen¬ 
tation viewpoint. They imply that a Sum hardware unit, if 
designed efficiently, can also be used to process the Count 
and Average aggregation functions. 

From these remarks, we can see that the processing of 
aggregation functions involves in general two columns: 
Group By (Cg) and aggregation (Ca). For instance, in Q2, the 


Group-By column is department, and the aggregation col¬ 
umn is salary. A tuple consisting only of these two attributes 
will be referred to as reduced tuple. During the process of 
performing Sum and Average, a new column called the sum 
column (Cs) is generated. Each value in Cs gives the sum of 
all Ca values corresponding to a distinct Cg value. Also, in 
the process of performing Count and Average, a new col¬ 
umn referred to as count column Cc is generated. Each Cc 
value gives the number of occurrences of a distinct value in 
the Group-By column. 


Table B. Instance of the relation professor 
* [faculty, department, salary , extra). 

faculty 

department 

salary 

extra 

Smith 

Electrical Eng. 

$39,000 

1 

Joe 

Mechanical Eng. 

$35,000 

1 

Susan 

Computer Sc. 

$36,000 

1 

Erick 

Electrical Eng. 

$38,000 

1 

Paul 

Electrical Eng. 

$37,000 

1 

Johannes 

Computer Sc. 

$65,000 

1 

Rick 

Computer Sc. 

$32,000 

1 

Gerard 

Computer Sc. 

$43,000 

1 

Kenneth 

Mechanical Eng. 

$40,000 

1 


Processing a larger number of tuples in one pass requires 
several slice processors connected according to Batcher’s odd- 
even network topology. 3 

The odd-even network topology has been used to design 
several special-purpose units. Batcher used it to design a 
sorting network, and Sood et al. 6 used this structure to imple¬ 
ment relational algebra operations. 

Abdelguerfi et al. 4 uses the same unit for 
signal processing. Here, we show how 
a larger Sum unit can be obtained by 
connecting a number of slice processors 
using the odd-even topology. 

In this type of network topology, an 
?7-input Sum unit is composed of 

^log 2 w -log n+ 4 ) 

4 

V J 

identical slice processors connected ac¬ 
cording to the odd-even network topol¬ 
ogy. (Throughout this article we refer to 


log 2 as log.) The longest path in an n-input Sum processing 
unit is composed of QognQogn + l)/2] levels. You will recall 
that the basic component of the Sum unit is a pipelined one- 
bit slice processor. Slices of each level take as inputs two bits 
(one per reduced tuple) from the preceding level, process the 
two bits in one clock cycle, and output two new bits to slices 
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Figure 9. Simulation of the example in Figure 8. 



Figure 10. Processing approach of Kim et al. 5 

of the next level. We designed a four-input unit using the slice 
processor of Figure 4 as a basic component. Since some paths 
are shorter than others, a synchronization, which can be 
achieved through the use of delay elements, is necessary. We 
simulated the example in Figure 8 using this unit. The result 
of the simulation appears in Figure 9. Notice that the tuples 
are Summed, marked, and sorted concurrently. In this 
example, the output of the processed tuples starts three clock 
cycles after the input process initiation. In general, for an 72- 
input Sum unit, the output of the processed tuples starts 
[ lograOogw + l)/2] clock cycles after initiation of the input 
process. 

Comparative analysis 

Kim et al. 5 presented the design of a parallel and pipelined 
query processor in which tuples process in parallel, one bit at a 
time. Relational database aggregation functions take three steps 
(see Figure 10). The relation is first sorted, then fed to an aggre¬ 
gation unit, and the resulting relation is sent to a duplicate 
marker. Three different network topologies perform statistical 
aggregation functions. 

In our design, sorting, aggregation, and duplicate marking 
take place concurrently. One network topology and one type 
of processing element (slice processor) implement these 
functions. 

The parameters used in the comparative analysis are 

• k, the number of bits in the binary representation of the 


Group-By values representation; 

• p, the number of bits in the binary representation of the 
aggregate values; 

• r, the time (in seconds) to manipulate and pass one bit 
to the neighboring slice processor; and 

• 72, the number of inputs in the Sum unit. 

In our implementation, the processing of the reduced tuples 
completely overlaps the input and output of the reduced 
tuples to and from the Sum unit. Since the longest path in an 
72 -input Sum unit is [ log72(log72 + l)/2] processing n tuples 
will take 


A( ri)- 


log 72(log 72 + l) 


+ (&+ p+l) 


( 2 ) 


Notice that our approach allows for pipelined processing 
of different streams of tuples (different relations). (The flags 
of each slice processor in the first column should be initial¬ 
ized by applying reset signal Sj after the input of each rela¬ 
tion completes.) Suppose that m relations, each composed of 
n tuples, are to be processed by the Sum unit. The process¬ 
ing of these m relations in a pipelined manner will reduce 
the processing time from mA(72 ) to A( n ) + (m - 1) (k + p + l)r. 
The processing is therefore reduced by 

log 72(log 72+ 1 ) 7 * (3) 


772 -1 
2 


seconds. 

The Sum algorithm we have described was an internal 
one. That is, we assumed the number of reduced tuples to be 
processed would be no larger than the number of inputs (72) 
of the Sum unit. In general, an entire relation cannot be pro- 
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Fine-grain architecture 
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Figure 11. A Sum unit with a Merge signal. 

cessed internally by an odd-even network. When the num¬ 
ber of tuples is too large for a Sum unit to process internally, 
an external algorithm is the most practical solution. An exter¬ 
nal algorithm is one that allows a chip (or a set of chips) of 
fixed size to process an input set of any size. 4 One approach 
to this problem is based on an iterative use of a Sum unit of 
fixed size. 

We based the proposed algorithm on successively merg¬ 
ing Summed sets of reduced tuples of increasingly larger size. 
The external algorithm will use a Sum unit of fixed size ( n - 
inputs) to process in an iterative manner a set of reduced 
tuples whose number Nis larger than n. Toward this end, we 
added a new control signal called Merge to the Sum unit. 
This signal allows an n-input Sum unit to have two operating 
modes. In the first mode (Merge = 0), the unit performs the 
Procedure Sum internal algorithm just illustrated. In the sec¬ 
ond operating mode (Merge =1) the unit merges two Summed 
sets of size n/2 each in logn steps. Figure 11 shows a four- 
input Sum unit with a Merge control signal. 

The external algorithm is divided into two steps. During 
the first step the Merge signal is reset to zero, and the Sum 
unit generates N/n Summed sets of size n each. (Without loss 
of generality, Nis assumed to be a multiple of n.) When the 
N/n sets process in a pipelined manner, the duration of this 
step is 


logn(logn + l) n / \ 

V - L + -{k+p+l)r 

2 n 


(4) 


1. The second step requires \og(N/ri) phases. During the zth 
phase, the unit converts 1 <i < log(A/n), (l/2) i-1 (A/n) sets of 
2 i_1 n Summed reduced tuples to (1/2)' (N/n) sets of 2 'n re¬ 
duced tuples each. Pipelining the \og(N/n) phases yields the 
following duration. 


f logn(log n+l)^ 


log(iV/ n) 

+ ( £ ((2 i+1 -l)/2^(k+p+l) 


log n( log n+1 ) 
2 


■ . N. N N 

+ | 2—log-hi 

n n n 


\ 

k+ p+l) r 

/ 


When pipelining is used between the two steps, the over¬ 
all duration of the external algorithm is 


A (n, A0 : 


log n(log n- hi) 

2 

+2(AV n)log(A7 n)- h(&+ p+ 1) 


( 6 ) 


In the VLSI query processor of Kim et al., tuples are pro¬ 
cessed in a tuple parallel bit-serial fashion. Unlike our ap¬ 
proach, this processor performs aggregation functions on 
presorted relations. An (n x n) sorter is used to sort internally n 
tuples and can also be used as a (nx n) two-way merger. De¬ 
signing an (n x ri) sorter requires n (3n -1) bit-serial processing 
elements. The longest path in the sorter consists of (2n-l) 
stages. The time needed to sort n tuples is (k + p + 2n -l)r. 

This sorter requires 4 n I/O pins, as opposed to 2 n for our 
Sum unit. The duration of merging two sorted lists of n tuples 
each is the same as that of sorting n tuples. The time needed to 
sort N> n tuples can be computed using an approach similar to 
the one used to compute A(n,A). First we sort each N/n set of 
n tuples. Next, the unit is used as an (n x ri) two-way merger. 
As Kim et al. indicate, the overall duration of the sorting is 


—(&+p+2n-l)fl + log— 1 

n \ n) (7) 

The sorted relation is then input to the aggregation unit. 
An n-input aggregation unit composed of log n stages re¬ 
quires nlogn processing elements. Processing N tuples will 
require about 


N 


(&+ p-\- log n ) r 


( 8 ) 


In the second step, the histogramming unit operates as a 
[(n/2) x (n/2)] two-way merger by setting the Merge signal to 


The third step is the process of assigning a mark bit to 
each reduced tuple so that duplicates can be removed. An n- 


42 IEEE Micro 























































Table 2. B(n,N)/A(n,N) for p 

= k = 16 bits. 

N 

N/n=2 A 

N/n=2 8 

N/n=V° 

N/n=2 u 

A//n=2 14 

16 

1.50 

1.20 

1.16 

1.12 

1.10 

32 

2.05 

1.75 

1.69 

1.65 

1.62 

64 

3.50 

2.96 

2.85 

2.78 

2.72 


input duplicate checker composed of (2 n - 1) PEs organized 
in two stages completes this step. 5 Processing N tuples will 
require about (N/ri)(k + p + 3)r time. 

The overall duration of performing Sum is thus 


B( n , N) 


N 


(k+ p+2n-\) 1 + log — 


b(&+ p+log n) + (&+ p+ 3) 


N 


(9) 


The comparative analysis shown in Table 2 makes it clear 
that the odd-even approach is significantly faster than the 
approach in Kim et al. For example, when N/n = 2 10 , the 
acceleration is 1.69 for n = 32 and 2.85 when n is increased 
to 64. Also, the performance gap decreases as the ratio 
N/n increases. We attribute this fact to the query processor 
sorter/merger, which requires An I/O pins, as opposed to 2 n 
for the Sum unit. 


Acknowledgments 

We thank all who helped us in this project. H. Mounaf 
provided the design and simulation. Wayne Patterson, direc¬ 
tor of the New Orleans Advanced Computation Laboratory, 
gave us the opportunity to use the computing facilities of his 
laboratory at various stages of the project. R. Loggins helped 
prepare the manuscript, and the referees offered comments 
and helpful discussions of the issues. 

Parts of this article appear in the Proceedings of the Sixth 
International Workshop on Database Machines dated June 
1989 and the International Confere?ice on Parallel Processings 
dated August 1990. A grant from the University of New Or¬ 
leans Research Council and the NSF/Louisiana Stimulus for 
Excellence in Research, EPSCoR Program under Grant NSF/ 
LaSER(1990)-RFAP-l4 partly supported our work. 


References 

1. E.F. Codd, "A Relational Model of Data for Large Shared Data 
Banks," Comm. ACM, Vol. 13, No. 6, pp. 377-387. 

2. M.S.Giovanny, "Fragmentation: ATechnique for Efficient Query 
Processing," ACM Trans. Database Systems, Vol. 11, No. 2, 
1986, pp. 113-123. 

3. K.E. Batcher, "Sorting Networks and their Applications," Proc. 
Spring Joint Comp. Conf., Vol. 32, Apr., 1968, pp. 307-314. 

4. M. Abdelguerfi, S. Khalaf, and A.K. Sood, "Bit-Serial Parallel 
Processing Unit for the Flistogramming Operation," IEEE Trans. 
Circuits and Systems, Vol. 37, No. 7, July 1990, pp. 948-954. 

5. W. Kim, D. Galski, and D.J. Kuck, "A Parallel Pipelined Relational 
Query Processor," ACM Trans. Database Systems, Vol. 9, No. 2, 
June 1984, pp. 214-242. 

6. A.K.Sood, M. Abdelguerfi, andW.Shu, "HardwareImplementation 
of Relational Algebra Operations," in Database Machines: Modern 
Trends and Applications, NATO ASI, Series F, Springer-Verlag, 
Berlin, 1986, pp. 321-380. 


OUR SPECIAL-PURPOSE, parallel bit-level, pipelined pro¬ 
cessing unit supports relational database aggregate functions. 
The architecture is based on the odd-even network topol¬ 
ogy, and the system is composed of one type of simple slice 
processor. Each slice operates on data, bit by bit, making the 
design and verification of the circuit easy. The data process¬ 
ing time is completely overlapped with the input and output 
of data to and from the unit. The design is independent of 
the tuple size, and since a bit-serial computation is used, the 
system requires limited interconnection. 

We designed and simulated a prototype four-input unit for 
fabrication using discrete components. Currently, we are de¬ 
signing and fabricating a VLSI prototype unit and studying 
clock skew over long serial paths and the compatibility of 
the design to RAM. (JB 
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A Parallel, Scalable, Microprocessor- 
Based Database Computer for 
Performance Gains and Capacity Growth 


The multiback-end database supercomputer, or MDBS, consists of the architecture and per¬ 
formance of an experimental database computer and a number of database processors and 
their corresponding database stores. The author relates two studies: one on the design goals 
and architectural considerations of the microprocessor-based MDBS and the other on the 
performance expectations and benchmark results in various loads and configurations. 


David K. Hsiao 

Naval Postgraduate School 



he MDBS architecture is unique in that 
each database processor is micropro¬ 
cessor-based and has its own database 
store. These stores consist of two dif¬ 
ferent types of disk drives: smaller, Winchester- 
type drives for paging and metadata (keys and 
key-related data); and a larger, standard-type drive 
for base data (database data). The multiple data¬ 
base processor-store pairs, known as database 
back ends, interconnect through a local area 
network, with reliable point-to-point communi¬ 
cations (one processor to another) and unreli¬ 
able broadcast communications (one processor 
to many others). 

The interconnected database back ends inter¬ 
face with another computer either directly via the 
LAN or indirectly via the database controller. With 
a direct interface, the database controller software 
runs in another computer known as the front end 
or server. With an indirect interface, the database 
controller is a microprocessor-based computer that 
uses either a tape station or built-in cassette tapes. 

We use MDBS as a research vehicle in the 
Laboratory for Database Systems Research at 
the Naval Postgraduate School for the study of 
the design and performance of parallel and scal¬ 
able database back ends. At present, MDBS can 
be configured or scaled into a one-back-end, two- 
back-end, and up to an eight-back-end database 
computer for parallel operations and performance 
analyses. 


The performance gain of MDBS is unique in 
that the response-time reduction of a transaction 
is in direct proportion to the number of back 
ends configured. For example, if we double the 
number of back ends in a parallel configuration, 
we reduce the response time of the same trans¬ 
action in the same database by almost half. 
Although the database remains the same in both 
configurations in this example, it must be redis¬ 
tributed to induce parallel access to any new, as 
well as existing database stores. We use response¬ 
time reductions to measure the performance gains 
of the database computer compared to the 
degree of its parallelism or the number of paral¬ 
lel back ends used. 

The growth capacity of MDBS is unique in that 
the response-time invariance of a transaction may 
be nearly upheld if the increasing degree of back¬ 
end parallelism is in direct proportion to the grow¬ 
ing capacity of the database. In other words, if 
we want to have nearly the same response time 
for a transaction as its database doubles, we sim¬ 
ply double the number of parallel back ends. 
Again, we must recluster and redistribute the 
grown database on the new as well as existing 
back ends to induce parallel accesses to all the 
database stores. We use response time invariance 
to provide nearly constant response times for the 
same transactions, despite database growth. We 
simply add parallel back ends proportionally. 

Finally, in this article, we point out some of 
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the difficulties and examine some unfinished studies that may 
have some impact on the architecture of parallel database 
back ends in general, and MDBS in particular. We have learned 
these lessons in the course of experimenting with this new 
architecture. We hope our architectural design and perfor¬ 
mance methodology will serve as a basis for the design and 
construction of future parallel, scalable, and microprocessor- 
based database computers. 

The MDBS architecture 

Figure 1 depicts the MDBS architecture. The major build¬ 
ing block for the high degree of parallelism is the database 
back end. To achieve a high degree of parallelism, we sim¬ 
ply add identical back-end hardware to the LAN, replicate 
the back-end software and metadata on the new hardware, 
recluster as necessary, and redistribute the 
base data onto the new and existing data¬ 
base stores. Let us elaborate on the major 
building block first. We can then discuss the 
rest of the architectural elements. 

Database back ends. Eight database back 
ends are available in our laboratory. Although 
we are not specific in Figure 1, all eight back 
ends are identical. We therefore focus on the 
design of a single back end in the following 
sections. 

Hardware elements. Each database back 
end is a combination of a database processor 
and a database store. 


CPU and internal data bus. Each data¬ 
base processor consists of a Motorola 
68020 microprocessor-based CPU with 
internal 32-bit-wide data and control 
buses. At 16.67 MHz, the 68020 is ad¬ 
equate for database operations, since the 
operations are mostly I/O-intensive 
rather than computationally intensive. 
Further, for any data transfer between 
the real memory and external buses, we 
use the VMEbus. The VME bandwidth is 
13.5 Mbytes/s. On the other hand, the 
real memory cycle of the database pro¬ 
cessor is 240 ns per 32-bit word. We use 
the following calculation: 

4/240 bytes/ns = 4 x 10 8 /240 bytes/s 
= 16.67 x 10 5 bytes/s = 16.67 Mbytes/s. 

Thus, since 15.5 Mbytes/s is below its 
16.67-Mbytes/s capability, the maximum 
amount of data that can be concurrently 
transferred in (and out) of real memory 


for processing over four buses is nearly covered by the 
bandwidth of the VMEbus. The buses consist of three 
for I/O—paging, metadata, and base data—and one com¬ 
munications bus for the LAN. There are still some memory 
cycles left (at the rate of 11.17 Mbytes/s) for the CPU 
control bus to access the real memory. 

Backplane and communication bus. All four exter¬ 
nal buses do not need to access the VMEbus simulta¬ 
neously. On those rare occasions when competition 
among memory cycles for real-memory access is keen, 
the CPU control bus always gives the right-of-way to 1/ 
Os and communications. Further, the communications 
bus, known as the backplane of the back end, is man¬ 
aged by another microprocessor with its own buffer 
memory, and also gives the right-of-way to I/O buses. 

Meta data disk 



Figure 1. Architecture of MDBS. 
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We discuss the backplane hardware later. 

• External I/O buses and the internal VMEbus. As 

discussed, the I/O buses can monopolize the entire 
bandwidth of the VMEbus. The microprocessor-based 
CPU and real memory can thus sustain disk drives whose 
maximum transfer rate is about 16 Mbytes/s. 

• Disk drive types. Our system contains three disk drives; 
one is for paging, another for metadata, and the third for 
base data. Because the 68020 microprocessor supports 
address translation, a larger virtual memory (from 2 
Mbytes to 32 Mbytes) may be supported by a smaller, 4- 
Mbyte real memory. There is still, however, need for a 
paging disk for virtual memory pages. MDBS uses a 96- 
Mbyte Winchester-type disk drive for paging, because 
most of the operating and database system modules are 
real memory-resident and their pages are locked into 
the real memory. We use another 96-Mbyte disk drive 
for the metadata. You will recall that database metadata 
are keys and key-related information. Queries use keys 
to access clustered records directly. The number of keys 
used relates to the average size of the database clusters, 
which in turn relates to the number of back ends in the 
configuration. Finally, a large, standard, 500-Mbyte 
moving-head disk stores the base data. Here, the key-to- 
record ratio of metadata storage to base data storage is 
96:500, nearly 1:5 per back end. 

• Disk transfer rates. The transfer rate of each disk drive 
is under 2 Mbytes/s. Thus, with three drives transferring 
simultaneously in or out of real memory, the maximum 
rate is under 6 Mbytes/s. On the other hand, real memory 
can sustain a rate of some 16 Mbytes/s, so there is enough 
capacity for the addition of, for example, two more sets 
of metadata and base data disks. 

• Database store. For each back end there may be 288 
Mbytes of metadata and 1.5 Gbytes of base data for its 
database store. For the eight-back-end database com¬ 
puter in our laboratory, there is a potential for some 12 
Gbytes of database data. On the other hand, the metadata 
are replicated. Thus, the real capacity of metadata re¬ 
mains at 288 Mbytes. The key-to-record ratio drops from 
1:5 in the one-back-end configuration to nearly 1:40 in 
the eight-back-end configuration. To maintain the ca¬ 
pacity of metadata relative to the capacity of base data, 
we may add more 96-Mbyte metadata disks and few r er 
500-Mbyte base data disks. In other words, the database 
store is scalable for direct access, depending on the ca¬ 
pacity of the metadata and base data. The database back 
end is also scalable. In our laboratory, it spans from one 
to eight. 

Software processes. Since a database is considered as resid¬ 
ing in a back end, we use the Directory Management process 

to manage metadata. We also use a process for managing 


base data, the Record Processing process. In each back end, 
transactions execute concurrently. Metadata and base data 
processing of a transaction take place serially, and the metadata 
processing of a transaction may overlap with the base data 
processing of another transaction. And, since metadata pro¬ 
cessing as well as base data processing for different transac¬ 
tions may take place concurrently, we need a Concurrency 
Control process. 

A back end must be able to receive transactions for pro¬ 
cessing and return responses. Thus, each back end has a pair 
of processes for communications: the Get process, for get¬ 
ting transactions or messages from the LAN, and the Put pro¬ 
cess, for placing responses or messages on the LAN. Directory 
Management, Record Processing, Concurrency Control, Get, 
and Put are the only processes of a back end. These pro¬ 
cesses are the same in every back end in a multiback-end 
configuration. Further, these five database management and 
communication processes use a Unix BSD 4.3 operating sys¬ 
tem with TCP/IP protocols. 

A microprocessor-based LAN. Through its Get and Put 

processes, the microprocessor-based LAN is the only com¬ 
munications link between the controller and back ends. 

LAN hardware elements. These elements are an Ethernet 
cable, the backplanes, and their transceivers. Each back end 
contains a backplane that consists of an Intel 80186 micro¬ 
processor and a small amount of real memory. The 128-Kbyte 
real memory is used mainly for buffering incoming or outgo¬ 
ing messages. It is also dual-ported to another microproces¬ 
sor (Intel 82586) that mainly executes low-level protocols, 
which in turn support high-level TCP/IP protocols. These 
protocols in aim support even higher protocols, such as UDP 
(Ethernet’s user-defined protocol). 

The Ethernet cable connects all the backplanes by way of 
transceivers. It thus enables all 82586 microprocessors to work 
in concert to execute a collision-resolution algorithm when¬ 
ever there is a collision or conflict of messages. One of the 
messages is then selected for routing, while all other mes¬ 
sages are deferred momentarily and considered for routing 
again. If, in the next routing, there is another collision, the 
algorithm executes again. Eventually, all messages will be 
routed to their designations so no messages are lost. 

This algorithm causes restricted point-to-point (one back¬ 
plane to another backplane) routing. In other words, if, for 
example, three point-to-point routings collide (say, from Back¬ 
plane 1 to Backplane 4, from Backplane 3 to Backplane 8, 
and from Backplane 7 to Backplane 2), the three pairs route 
one after another, although we cannot predict the routing 
sequence. We note that, in this example, all backplanes in¬ 
volved in the routing are disjointed, and no two pairs overlap 
on an identical backplane. The Ethernet cable has a transmis¬ 
sion rate of 10 Mbps. 

On the other hand, the 82586-based Ethernet does not 
provide reliable broadcasting (one backplane to many 
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backplanes routing). For example, if Backplane 2 wants to 
broadcast a message to all other backplanes, namely, 
Backplanes 1 and 3 through 8, the routing takes place. How¬ 
ever, the message may not get to one or more of the other 
backplanes, let’s say, Backplane 7. This kind of unreliable 
routing is also unpredictable. Lastly, the 82586-based Ether¬ 
net does not provide reliable multibroadcasting either. 

LAN software protocols. Since the LAN’s own TCP/IP soft¬ 
ware does not support reliable broadcasting and 
multibroadcasting, we built another layer of protocols, UDPs, 
for these operations. 

At each back end, a logical socket for each and every other 
back end is defined. Whenever a message reaches a particu¬ 
lar back end, that back end must acknowledge receipt. The 
acknowledgment is deposited in its corresponding socket at 
the sender’s back end. This send-receive-acknowledge pro¬ 
cedure via a socket is not necessary if the message routing is 
point-to-point (one back end to another). In other words, the 
acknowledgment is always there and can be ignored. 


Our investment in the software 
is minimal, because it involves 
only a few memory locations 
for sockets. 


In broadcasting a message to other back ends, the UDP 
software examines the acknowledgments deposited at the 
sender’s back end. If there is no acknowledgment at a spe¬ 
cific socket, the program resends the message to the corre¬ 
sponding back end’s socket by way of point-to-point routing. 
Obviously, the receiver’s back end cannot begin its database 
operation simultaneously and in parallel with other back ends 
that had an earlier start on the same message. MDBS’s back 
ends are not tightly coupled, do not share primary and vir¬ 
tual memories, and have their own database stores. Thus, all 
other back ends can proceed individually and in parallel with¬ 
out any interference or delay from one or more identical 
back ends. 

The UDP software helps make unreliable broadcasting reli¬ 
able. Our investment in the software is minimal, because it 
involves only a few memory locations for sockets. The use of 
existing and reliable point-to-point protocols for resending 
messages is overhead, but it does not interfere with other 
back ends that require no resending. So, with reliable broad¬ 
casting, we now also have reliable multibroadcasting—a com¬ 
munication feature necessary to our back ends. The back ends 


and controller access the UDP software with Get and Put. 

Database controller. As stated previously, the controller 
software may run in a general-purpose front-end computer 
that interfaces with the database back ends via the LAN. How¬ 
ever, because it does not interfere with other program activi¬ 
ties and programs in the front-end computer, for research 
and experimental purposes, a dedicated database computer 
like MDBS is preferred. 

Controller hardware. Like a back end, the controller hard¬ 
ware is based on a 68020 microprocessor with address trans¬ 
lation capability, so we still need a paging disk. Unlike a 
back end, the controller accesses neither metadata nor base 
data. On the other hand, the controller is responsible for the 
backup and recovery of the back end and LAN software. 
Thus, we place a tape station at the controller. Backup and 
recovery tapes for on-line and real-time backups and recov¬ 
eries can be readily made. We also plan to use tapes to load 
a massive new database into back-end database stores. In 
addition, we use tapes to load benchmarks for performance 
analyses. 

Controller software. Because the controller computer is the 
sole interface with the “outside” world, it needs two pro¬ 
cesses: one for receiving database transactions from the front 
end and another for returning database results to the front 
end. These processes are known as TP (Request or Transac¬ 
tion Processing) and PP (Postprocessing). 

A user request may be a new database transaction requir¬ 
ing compilation, a canned transaction or macro in a transac¬ 
tion library, or an on-line query requiring immediate response. 
TP must be able to uniquely identify each request, prepro¬ 
cess it, and broadcast the request to all back ends. Each back 
end, on the other hand, places a broadcast request in its 
request (or transaction) queue. The preprocessing of each 
transaction amounts to a translation of the transaction into 
one or more back-end primary database operations. We dis¬ 
cuss the five primary database operations, RETRIEVE, 
DELETE, INSERT, UPDATE, and RETRIEVE COMMON, in a 
later section on databases and their operations. 

The PP process, on the other hand, is the postprocessing 
of database records. For example, if a user desires the sum of 
all the salaries from the salary database, each back end 
returns a subsum of all the salaries from its portion of the 
salary database. Then, PP adds all the subsums to produce 
the overall sum that is then routed to the user. In this ex¬ 
ample, PP performs the aggregate function, Sum. In general, 
PP performs a large number of aggregate functions. By inter¬ 
facing with TP, PP can uniquely identify the correct user to 
return the result or error message. 

TP and PP interface with the outside world. We use an 
Insert Information Generator, or IIG, process designed solely 
to assist a back end during an INSERT operation in the 
“inside” world of the database computer. Of the five primary 
back-end database operations, the INSERT operation is the 


December 1991 47 







only one that is not a set operation. In other words, while 
each of the other four primary operations works with a set of 
records at a time, the INSERT operation works on a single 
record at a time. 

Upon insertion of a record, the particular cluster in which 
the record belongs must be identified. However, the cluster 
is evenly distributed over a number of back-end disk stores. 
The database store that provides the latest available storage 
space for any record insertion is identified in a space utiliza¬ 
tion table maintained by the IIG. The space utilization table 
keeps the following information up to date. For each cluster 
of records, it identifies the back end whose database store 
contains the first trackful of the cluster, and it identifies the 
back end whose database store contains the last trackful of 
records. It also identifies the back end whose database store 
can provide the first available track for inserting the new 
trackful of clustered records. The maintenance and alloca¬ 
tion by the IIG are based on a round-robin algorithm 
explained later. In any case, with the help of the space utili¬ 
zation table, the IIG instructs a specific back end to insert 
records into its database store. The actual insertion is made 
through the processes of the specific back end. Further, since 
metadata appear in every back end, the specific back end 
must also broadcast its metadata updates to all other back 
ends for replication. 

Like a back end, the controller also employs Get and Put, 
to communicate with one or more back ends. Thus, the Unix 
BSD 4.3 system supports five processes (TP, PP, IIG, Get, 
and Put) in the controller. 

Architectural elements and system processes. In Fig¬ 
ure 2, we illustrate the relationship between the elements 
and the system processes. Since all the architectural elements 
and system processes in a back end are identical to the ones 
in another back end, we depict those in only one back end. 
We observe that, for communications, the Put process of one 
computer sends messages to one or all Get processes in the 
other computers. Thus, Put can facilitate either one-to-one or 
broadcasting in its communication with the other computers 
in the system. 

Despite its name, the controller does not control and monitor 
the back ends in their parallel database operations. The iden¬ 
tical parallel back ends are autonomous in that each picks up 
a transaction from the head of its transaction queue, executes 
the transaction against its own meta- and base data, and sends 
out its results through its Put process. Except in the RETRIEVE 
COMMON operation, a back end does not send any result to 
another back end. Instead, the controller becomes the col¬ 
lector of all the back-end results for postprocessing. It routes 
final results to the user or front-end computer. It is possible 
to eliminate the controller hardware and am the five control¬ 
ler processes in a powerful front-end computer, provided 
the front-end operating systems, like our Unix BSD 4.3, sup¬ 
port the process communications and TCP/IP protocols. 



Meta data store Base data store 


CC Concurrency Control PP Postprocessing 

DM Directory Management RP Record Processing 

IIG Insert information generator TP Transaction Processing 


Figure 2. Architectural elements and their system pro¬ 
cesses. 

Data organization and repertoire of 
operations 

As a database computer, MDBS must have a data model to 
characterize its database and a data language to interpret 
elementary operations of a transaction. The data model used 
for MDBS is the attribute-based data model, or ABDM, and 
the data language supported by MDBS is the attribute-based 
data language, or ABDL. The rationale for using ABDM 
instead of a conventional data model, such as relational, is 
derived from the following intrinsic properties of ABDM. It 

• models metadata and base data separately; 

• introduces an equivalence relation, which partitions the 
database into mutually exclusive sets of base data called 
clusters; and 

• allows clustered records to be distributed in back ends, 
which induces parallel access to database stores. 
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The rationale for implementing ABDL instead of a conven¬ 
tional data language such as SQL, derives from the following 
intrinsic properties of ABDL. ABDL 

• supports a parallel search algorithm based on predicates, 
and 

• is semantically rich and complete so that transactions 
written in conventional data languages like SQL may be 
translated into ABDL for execution in MDBS. 

These and other properties of ABDM and ABDL are dis¬ 
cussed later. 

Base data and metadata. Every piece of data in the data¬ 
base is characterized in ABDM as an attribute-value pair. Thus, 
when we refer to a piece of data, say, USA, in an example 
database, we not only refer to its value, USA, we also know 
its attribute, Country. An attribute-value pair is formally 
denoted as an ordered pair enclosed in angular brackets, 
<Country, USA>. Attribute-value pairs are the building blocks 
of the database. 

Base data. A record is a set of attribute-value pairs such 
that no two attribute-value pairs of a record have the same 
attribute, and at least one attribute of a record is typed. The 
first rule ensures that any attribute of the record is single¬ 
valued. The second rule ensures that at least one key identi¬ 
fies the record. A record is formally denoted as a set of 
attribute-value pairs enclosed in parentheses: (<File, aircraft>, 
Classification, top-secret>, <Plane-Type, fighter>, <Radius, 
799 nautical miles>, Country, USA>, <Fuel-capacity, 600 
gallons>, <Pilot-Grade, test-pilot>). All the records of the 
database comprise its base data. In reality, there are thou¬ 
sands, if not millions, of base data or records, in the data¬ 
base. 

Metadata. There are three attribute types. The Type A 
attribute is one of disjointed value ranges. Thus, a Type A 
attribute partitions its values into value ranges. For example, 
the attribute Radius in Table 1 is a Type A. A Type B attribute 
is one of distinct values. For example, the attribute Country 
in Table 1 is a Type B, since it has two distinct values, USA 
and USSR (shown in Table 2). A Type C attribute is one of 
those distinct values being entered by the user in real time. 
Thus, Type Cs are like Type Bs in assuming distinct values. 
However, Type B attributes are created at the time of data¬ 
base creation or at generation time. Attributes and their types 
collect in the Attribute table (Table 1). For a real-world data¬ 
base, an attribute table consists of tens, if not hundreds, of 
attributes. We illustrate only five in Table 1. 

We call each Type A attribute and its value ranges Type A 
descriptors. Similarly, we call Type B and Type C attributes 
and their distinct values Type B and Type C descriptors. If 
we relate the descriptors here to conventional keys, we have 
three different kinds of keys. Whereas both Type A and Type 
B descriptors tend to remain the same, because of their cre¬ 
ation at database generation time, the Type C descriptors 


may increase rapidly if a user enters records consisting of 
many new values having the Type C attribute. All the 
descriptors collect in a table known as a descriptor-to- 
descriptor-identifier table, or for short, the Descriptor table 
(Table 2). For a real-world database, a descriptor table con¬ 
sists of hundreds, if not thousands, of descriptors. In Table 2, 
we illustrate only 16. Specifically, Radius results in six Type A 
descriptors; Plane-Type, in three Type C descriptors; Coun¬ 
try, in two Type B descriptors; File, in one Type B descriptor; 
and Classification, in four Type C descriptors. 

Let D, be the set of D J;I for all j t . Then the product D1 x D2 
x , ..., x Dn is an equivalence relation whose members par¬ 
tition the database data into mutually exclusive sets of records, 
or attribute-value sets. These record sets we term clusters. 

Referring to our sample in the Attribute and Descriptor 
tables, we note there are five attributes. Thus, n = 5. For Dl, 
or Radius, we see six descriptors. Thus, for i - 1, j t = j = 6. 
Similarly, for i = 2, j 2 = 3 for D2 = Plane-Type; for i= 3, A = 2 
for D3 = Country; for i = 4, j 4 = 1 for D4 = File; and for i = 5, 


Table 1. Attributes. 

Attribute 

Attribute type 

DDIT entry 

Radius 

A 

Dll 

Plane-type 

C 

D21 

Country 

B 

D31 

File 

B 

D41 

Classification 

C 

D51 


Table 2. Descriptors. 

ID 

Descriptor 

Dll 

0 < Radius < 400 

Dl 2 

401 < Radius < 600 

Dl 3 

601 < Radius < 800 

Dl 4 

801 < Radius < 1,000 

Dl 5 

1,001 < Radius < 1,200 

Dl 6 

1,201 < Radius < 2,000 

D21 

Plane-Type = fighter 

D22 

Plane-Type = bomber 

D23 

Plane-Type = reconnaissance 

D31 

Country = US 

D32 

Country = USSR 

D41 

File = aircraft 

D51 

Classification = top-secret 

D52 

Classification = secret 

D53 

Classification = confidential 

D54 

Classification = unclassified 
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j 5 = 4 for D5 = Classification. The cardinality of the product 
D1 x D2 x D3 x D4 x D5 is 144, since j x x j 2 x j 5 x j A x p = 6 
x3x2xlx4 = 144. 

Potentially, this database may have up to 144 clusters, each 
of which is uniquely identified by a set of descriptor identifi¬ 
ers, with corresponding descriptors characterizing the cluster 
records. In reality, many clusters are empty, since there are 
no records in the database being interpreted by the descrip¬ 
tor identifier sets determining the clusters. MDBS does not 
track empty clusters. Nor does it record their descriptor iden¬ 
tifier sets in the Descriptor Table. The three kinds of identifi¬ 
ers are the 

1 ) system-generated record, which identifies the records of 
a cluster characterized by a descriptor identifier set; 

2 ) cluster, which identifies the cluster containing the records; 
and 

3) descriptor, which identifies the descriptor-identifier set 
that determines the cluster. 

These identifiers reside in the table known as the database 
cluster definition table, or, for short, the Cluster table. In our 
sample database, the Cluster table (Table 3) identifies 12 clus¬ 
ters of 33 records. Each cluster is defined by a unique set of 
descriptors. For example, the first cluster Cl is defined by 
five descriptors whose identifiers are D13, D21, D31, D4l, 
and D51. Three records in that cluster have their own record 
identifiers, Rl, R2, and R13. Despite their differences, these 
records all consist of top-secret aircraft data (defined by D51 
and D4l) about US fighters (identified by D31 and D21) hav¬ 


Table 3. Clusters. 


ID Descriptor ID set Record ID 


Cl 

{D13, 

D21, 

D31, 

D41, 

D51} 

Rl, R2, Rl 

3 

C2 

{D12, 

D21, 

D31, 

D41, 

D 51} 

R8, R9 


C3 

{D14, 

D21, 

D32, 

D41, 

D51} 

R20 


C4 

{D13, 

D21, 

D32, 

D41, 

D51} 

Rl 7 


C5 

{D12, 

D21, 

D32, 

D41, 

D 51} 

R21, R24, 

R27, 







R30, R31 


C6 

{D13, 

D21, 

D31, 

D41, 

D 52} 

R3, R4, Rl 

4 

C7 

{D12, 

D21, 

D31, 

D41, 

D 52} 

RIO, Rl 1 


C8 

{D13 # 

D21, 

D32, 

D41, 

D 52} 

Rl8, Rl9 


C9 

{D12, 

D21, 

D32, 

D41, 

D52} 

R22, R23, 

R25, 







R26, R28, 

R29, 







R32, R33 


CIO 

{D13 # 

D21, 

D31, 

D41, 

D53} 

R5, R6, Rl 

5 

C11 

{D12, 

D21, 

D31, 

D41, 

D53} 

Rl 2 


Cl 2 

{D13, 

D21, 

D31, 

D41, 

D54} 

R7, Rl 6 



ing a combat range of 601 to 800 nautical miles (D13). In 
practice, a cluster table has tens, if not hundreds, of entries. 

The three database tables, Attribute, Descriptor, and Clus¬ 
ter, contain the database metadata. Since the product D1 x 
D2 x , ..., x Dn induces an equivalence relation, no record 
identified in the Cluster table can belong to two different 
clusters. Furthermore, records in each cluster are unique. The 
equivalence relation is the foundation for our database paral¬ 
lel access and storage strategies. 

Distribution. The distribution of metadata and base data 
to their separate database stores takes place differently, al¬ 
though both types attempt to access their stores in parallel. 

Metadata. Since metadata is typically one or two orders- 
of-magnitude smaller in size than base data, we decided to 
replicate the metadata onto each of the back-end database 
stores. We also use the smaller, dedicated disk drives for 
storage of replications. Subsequently, all the back ends can 
access their own metadata stores and the same sets of At¬ 
tribute, Descriptor, and Cluster tables, in parallel. 

Furthermore, parallel metadata accesses may be overlapped 
with the parallel base data accesses for any other transaction, 
since base data stores use separate and larger disk drives. 
Therefore, MDBS achieves concurrent and parallel execu¬ 
tions of transactions. 

Base data. Base data make up the bulk of a database. 
Therefore, they are not replicated for storage. Further, they 
are stored on each back end’s own high-capacity disk drives 
in a prescribed fashion. The method is as follows: 

1) from the Cluster table, the controller picks up a cluster 
identifier and its associated records; 

2 ) the controller blocks a variable-size cluster into fixed- 
size trackfuls of records; 

3) the controller determines the identifiers of the back ends, 
each of which can provide a track of available storage 
for the cluster identified (recall that the controller IIG 
has a storage utilization map to keep track of such 
information); 

4) the controller sends in parallel all the trackfuls of records 
to the back ends identified; 

5) each identified back end places its block of one or more 
trackfuls of clustered records into its base data store and 
enters identifiers of records stored onto the replicated 
Cluster table entry corresponding to the cluster on the 
metadata store; 

6 ) the controller then updates its space utilization map with 
respect to this cluster; and 

7) the entire procedure is repeated for all subsequent 
clusters. 

This turn-taking and one-track-per-back-end database dis¬ 
tribution (or redistribution) algorithm has a desirable effect 
in that records of a cluster are evenly distributed (or redistrib- 
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uted) over a set of separate, parallel database stores. Subse¬ 
quent accesses to the records of a cluster become parallel. In 
other words, we have induced the record-parallel and cluster- 
serial, or RPCS, operation for our database access operations. 

The choice of the cluster size is important in a parallel 
access operation. Let the number of record tracks in a cluster 
be m and the number of back ends n. Then, the maximum 
number of record tracks for a cluster should also be n, for 
example, m < n. To fine tune the cluster size, we control the 
number of descriptors used in the identification of the clus¬ 
ters. In general, the more descriptors used, the smaller the 
clusters become. If MDBS has many n and few m, say, 
3 m = n , its back ends can access several (three) clusters in 
parallel. Thus, we have induced record-parallel and cluster- 
parallel, or RPCP, operation for our database access opera¬ 
tions. Figure 1 depicts both RPCS and RPCP operations for a 
sample distribution of clusters of the database. 

Database operations. A database transaction is composed 
of one or more of the five primaiy database operations, RE¬ 
TRIEVE, DELETE, INSERT, UPDATE, and RETRIEVE COM¬ 
MON (see Figure 3). Except INSERT, which inserts one new 
record at a time into an existing database, these operations 
are set-oriented. More specifically, RETRIEVE, DELETE, and 
UPDATE operate on one set of records at a time, whereas 
RETRIEVE COMMON operates on two sets of records at a 
time. 

The query—a Boolean expression of predicates. The input 
for all set-oriented primary database operations is a query. 
The output of an operation is a set of records that has satis¬ 
fied the query and has been operated on. The query allows 
us to specify the properties of a record set without having to 
provide the record addresses of the set. These properties are 
specified in terms of a Boolean expression of predicates. A 
predicate is a 3-tuple, consisting of an attribute, a relational 


operator, and a value. The set of relational operators include 
=, <, >, <, and >. Boolean operators include a (And), and I 
(Or). The following is a query consisting of the conjunction 
of two predicates (one less-than-or-equal-to predicate and 
one equality predicate) and one disjunction of three equality 
predicates. 

(((Radius < 600) a (File = aircraft) a ((Classification 
= secret) I (Classification = confidential) I (Classification 
= unclassified))) 

Transaction processing in the controller transforms all que¬ 
ries into their equivalent disjunctive normal forms for set- 
oriented primary operations. 

Parallel predicate search algorithm. In Figures 4 and 5 on 
the next two pages, we depict our parallel search-for-records 
algorithm on the basis of a given query. To simplify the illus¬ 
tration, we use a simple query as a conjunction of two predi¬ 
cates instead of the disjunctive normal form of many 
predicates. Given a query, the back ends can determine the 
descriptor sets that satisfy the predicates of the query, since 
descriptors and predicates are similar in form. They do differ 
in where they appear. Descriptors are held in metadata or 
Attribute and Descriptor tables, and predicates in user que¬ 
ries. Once determined, we use these descriptor sets to com¬ 
pute the descriptor-identifier sets, which in turn allow us to 
determine the clusters characterized by them in the Cluster 
table. Thus, at the time the parallel search algorithm is being 
used, two input parameters have been given to the algo¬ 
rithm: the query and the cluster identifiers from which records 
will be searched for checking against the query. 

Let us observe the algorithm with the following sample 
query in Figure 4: 

1) The controller broadcasts the query to all the back ends 
for execution. 

2) Each back end uses the query to determine those clus- 


RETRIEVE (query) [target-list} [by-clause] 

DELETE (query) 

UPDATE (query) (modifier) 

RETRIEVE (query-one) [target-list-one] 

COMMON (attribute-one, attribute-two) 

RETRIEVE (query-two) [target-list-two] 

INSERT (<attribute-first, value-first>, <attribute-next, 
value-next> ,..., <attribute-last, value-last>, {any 
arbitrary string}) 


Figure 3. Five primary operations. 


ters that may have records satisfying the query. Each 
back end’s Descriptor and Cluster tables aid this deter¬ 
mination. The cluster number and identifiers determined 
for the query are now known by all back ends. 

3) All back ends access one or more clusters of records. 
Since clustered records are evenly distributed over the 
back ends’ disks and the accesses are either RPCS or 
RPCP operations, parallel database streams flow to the 
back ends from their respective database stores. 

4) Each back end does the following: It compares the 
attribute of the first predicate of the query with the 
attribute of the first attribute-value pair of the forthcom¬ 
ing record. If the attribute matches, the back end checks 
whether its attribute value satisfies the value of the predi¬ 
cate dictated by the relational operator of the predicate. 
If satisfied, it repeats the comparison and checks for the 
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Employees 

Employee 

Employee 

Department 



Vacation 

Vacation 



number 

name 


name 


Salary 


earned 


used 

(a) 














1 


2 


3 


4 


5 


6 


1 . 


10 

Smith 

Sales 

10000 

10 

0 


1 


2 


3 


4 


5 


6 


2. 


15 

Jones 

Sales 

30000 

5 

6 


1 


2 


3 


4 


5 


6 


3. 


21 

Brown 

Purchasing 

22000 

12 

0 


1 


2 


3 


4 


5 


6 


4. 


35 

Smith 

Marketing 

30000 

9 

2 


1 


2 


3 


4 


5 


6 


5. 


42 

Smith 

Purchasing 

20000 

8 

5 


1 


2 


3 


4 


5 


6 


6. 


50 

White 

Purchasing 

25000 

0 

0 


1 


2 


3 


4 


5 


6 


7. 


62 

Gray 

Marketing 

10000 

4 

2 


1 


2 


3 


4 


5 


6 


8. 


71 

Hall 

Sales 

15000 

10 

3 


1 


2 


3 


4 


5 


6 


9. 


75 

Green 

Sales 

10000 

8 

9 


(b) Field 


Figure 4. Parallel search algorithm by predicates: record template (a); compari¬ 
son of bit serial and database stream parallel and single sweep (b). 


next predicate and attribute-value 
pair until there are no more predi¬ 
cates in the query. In this case, the 
record satisfies the query and is 
routed as output. If any one of the 
following cases takes place, the 
record is not output, and the search 
begins with the next record in the 
stream. These cases are: the first 
comparison for identical attributes 
fails; the next comparison fails the 
relational operator test; or the 
record under consideration runs 
out of attribute-value pairs before 
the query runs out of predicates for 
comparison. 


The parallel search algorithm makes 
one sweep of all the database streams 
coming from the database stores with¬ 
out the need to recall those records hav¬ 
ing been swept. Our encoding 
technique makes this possible. First, we 
use attribute identifiers, as illustrated in 
the Attribute table, in lieu of attribute 
names in both attribute-value pairs and 
predicates. We then order the attribute- 
value pairs monotonically by their at¬ 
tribute identifiers and query predicates. 

The effect of this algorithm is the cre¬ 
ation of a single-query/multiple-data- 
base-streams, or SQMD, operation. If, at 
a back end’s database store, there is no 
clustered data for a query, the back end 
chooses the next transaction in its trans¬ 
action queue to execute. The query of 
the new transaction is also new. Thus, 
we achieve a multiple-query/multiple- 
database-streams, or MQMD, operation, 
too. 

MQMD operations are the best we 
can achieve in parallelism for a parallel 
database computer. However, it should 
be observed that the letter “Q” in SQMD 
and MQMD replaces the letter “I” (instruction) in SIMD and 
MIMD, which are the two best parallel operations in a 
supercomputer. Note also that the letter “D” in SQMD and 
MQMD, differs from the letter “D” in SIMD and MIMD. Our 
“D” stands for database streams coming from their respective 
base-data disks, while the classical “D” stands for data items 
coming from main memories. To sustain parallel database 
streams, we cluster data and distribute the clustered data evenly 
across multiple disk stores. Our parallel algorithms achieve 


SQMD and MQMD, as well as RPCS and RPCP operations over 
the database. 

RETRIEVE COMMON. As mentioned earlier, RETRIEVE, 
DELETE, and UPDATE are three database primary opera¬ 
tions, each of which operates on one set of records at a time. 
A user query specifies the set of records to be operated on. 
On the other hand, RETRIEVE COMMON operates on two 
sets of records, each of which is specified by a query. Thus, 
a RETRIEVE COMMON operation involves two queries. Es- 
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4 



= 

Jones 

and 

> 

10000 


(a) 




Database store (disks) 



Another database store (disks) 



Figure 5. Conjunction of an equality predicate and a greater-than predicate (a); and predicate engine (b). 


sentially, this operation retrieves two sets of records and out¬ 
puts those records that have common attribute values. It is 
equivalent to a relational equijoin operation whose complex¬ 
ity is about the same as that of a tape-oriented merge of two 
files. However, in a parallel architecture using a broadcasting 
LAN, we need a unique algorithm. 

The RETRIEVE COMMON algorithm for multiple back ends 
interconnected by a broadcasting LAN is as follows: 

1) In SQMD mode, each back end retrieves its first subset 
of records on the basis of the first query given. 

2) For each record retrieved, each back end identifies the 
common attribute specified in the primary operation, 
hashes its attribute value into a virtual memory address, 
stores the record in the virtual memory so addressed, 
and repeats this step until all the records retrieved have 
been hashed into the back end’s virtual memory. 

3) In SQMD mode, each back end then retrieves the sec¬ 
ond subset of records on the basis of the second query* 
given. 

4) For each record retrieved, each back end identifies the 


common attribute as specified in the primary operation; 
hashes its attribute value into a virtual memory address; 
fetches all records with the same virtual address, if any, 
from the virtual memory; compares its attribute value 
with the attribute value of the records fetched; outputs 
both records when they do compare; and repeats this 
step until all the records of the second subset have been 
retrieved and processed. 

5) Each back end then broadcasts its second subset of 
records to all the other back ends. 

6 ) For each record received via broadcasting, each back 
end repeats step 4. When step 4 is finally completed for 
all records, the operation ends. 

We must note that, although there are two sets involved in 
the operation, only the second set was broadcast. If there are 
n back ends, we have, at most, n mutually exclusive subsets 
of the second set, each of which is broadcast to the other (n 
- 1) back ends. The load of the broadcast LAN is therefore 
the cardinality of the second set, that is, the sum of the cardi¬ 
nalities of its subsets. 
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MDBS performance 

We measure MDBS performance in two ways: response 
time reduction, or RTR, and response time invariance, or RTI. 

If the response time of a transaction is too long for our 
liking, we may add more database back ends, replicate both 
the system software and metadata on the new hardware, and 
redistribute the same base data onto existing and new stores. 
Ideally, the reduction in response time is inversely propor¬ 
tional to the hardware added. For example, if we double the 
number of back ends for a transaction in a database, we 
would like a reduction in response time of one half. This is 
the RTR. 

If the response time gets longer as the database grows in 
size, we may add more hardware, replicate the software and 
metadata as above, and redistribute the increased base data 
onto the existing and new hardware. Ideally, the response 
time of the transaction will remain the same as long as the 
new hardware added to the configuration is in proportion to 
the growth of the database, say, a doubling of the number of 
back ends for every doubling of database size. This is the 
RTI. See Demurjian 12 for formal definitions and formulas of 
RTR and RTI. 

Benchmark transactions. As an example, let’s use a set 
of benchmark transactions and a sizeable benchmark data¬ 
base to illustrate RTR measurements. The database is redis¬ 
tributed on a variety of back-end configurations, ranging from 
the baseline configuration (one-back-end), to a highly paral¬ 
lel configuration (eight-back-end). (Refer to Figure 1.) 

Let’s also introduce a set of benchmark databases whose 
sizes are proportional to storage capacities and the number 
of back ends. We use the set of benchmark databases and 
the same set of benchmark transactions to conduct RTI mea¬ 
surements over eight configurations as well. There are two 
kinds of benchmark transactions: overhead-intensive and data- 
intensive. 

Overhead-intensive transaction. This transaction tends to 
access a small portion of a database and perform intended 
database primary operations on even smaller portions of 
accessed data. Let’s say that an overhead-intensive transac¬ 
tion (Transaction 1) is a RETRIEVE operation that accesses 
only 4 percent of the benchmark database, checks them against 
the query, and outputs only those satisfying the query. The 
output is about one half the data accessed, or 2 percent of 
the database. Let’s use another overhead-intensive transac¬ 
tion (Transaction 6), a DELETE operation. Like Transaction 1, 
it accesses 4 percent of the database and qualifies one half, 
or 2 percent of the data accessed. Unlike Transaction 1, though, 
it deletes those qualified 2 percent from the database. 

Data-intensive transaction. This transaction tends to 
access a large portion of the database and perform the 
intended operation on a relatively small portion of the 
accessed data. Let’s consider three retrieves that are data- 
intensive. Transaction 2 accesses 26 percent of the database, 


of which 96 percent of the records (25 percent of the data¬ 
base) satisfy the query. Transaction 3 accesses 50 percent of 
the database, of which one half of the records (25 percent of 
the database) satisfy the query. Transaction 4 accesses 100 
percent of the database, of which one half of the records (50 
percent of the database) satisfy the query. Two DELETES are 
also data-intensive. Transaction 5 accesses 50 percent of the 
database, with one half of the records (25 percent of the 
database) being deleted. Transaction 7 accesses 100 percent 
of the database, with one half of the records (50 percent of 
the database) being deleted. 

We have now proposed seven benchmark transactions, 
where two are overhead-intensive and five are data-inten¬ 
sive. (Typical database operations are mostly data-intensive.) 
We have restricted these seven transactions to two primary 
database operations, RETRIEVE and DELETE. The three re¬ 
maining primary database operations, INSERT, UPDATE, and 
RETRIEVE COMMON, are not considered for benchmarking. 

Because INSERT is not a set operation, it cannot take ad¬ 
vantage of the parallelism of our database computer. There 
should be no response time reduction or response time in¬ 
variance over the conventional sequential database computer. 

UPDATE can be viewed as a series of three operations: 
RETRIEVE a set of records for updating, DELETE the original 
set of records from the database stores, and INSERT one at a 
time all newly updated records. 

Because RETRIEVE and DELETE have been included in 
the benchmark study, and INSERT has been left out, there is 
no need to benchmark the update function. 

Finally, we have also decided not to benchmark RETRIEVE 
COMMON, which is not only a set operation but also 
involves two sets of records. At present, RETRIEVE COM¬ 
MON works well with two small sets of records. They are too 
small to accommodate the sizes of our two benchmark record 
sets. The size limitations are not attributed to the reliability of 
the broadcast protocols, nor are they attributed to the paral¬ 
lel and broadcast algorithms of RETRIEVE COMMON out¬ 
lined above. They are attributed to the microprocessor-based 
backplanes of the Ethernet. We discuss hardware issues later. 

Benchmark databases. The choice of database, cluster, 
and record sizes for our benchmark study is the first step 
toward the creation of the benchmark databases. 

Multiplicative factor. For the RTR study the same database 
is benchmarked over one to eight back ends. Thus, the data¬ 
base size must be divisible by each back-end configuration. 
We must choose, therefore, the lowest common multiple of 
the back-end numbers as the factor for the database size. 
Thus, the lowest common multiple of 1,2, 3, 4, 5, 6, 7, and 8 
is 840. For the RTI study we replicate the same database on 
additional back ends. Thus, for this study the total database 
size is the number of back ends times the size of a database. 

Record sizes. There are three record sizes: 1,000 bytes for a 
large record, 500 bytes for a medium-size record, and 100 
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bytes for a small-size record. There are three attribute types: 
one Type A and two Type Bs. The attribute values take only 
a small number of bytes in each record. The rest of the record 
consists of filler attribute values whose attributes are not typed 
or kept in the metadata tables. The value ranges of the Type 
A attribute and the distinct values of the two Type B attrib¬ 
utes are carefully chosen to induce the following cluster sizes. 

Cluster sizes. For convenience in our benchmark effort, we 
chose the same byte size for each of the four different cluster 
sizes, although the record numbers in these clusters may be 
different. The four cluster sizes are large (1,000-byte records), 
medium-large (500-byte records), medium (200-byte records), 
and small (100-byte records). Each cluster contains 1.68 
Mbytes. 

More specifically, there are 140 large clusters, ranging from 
four 1,000-byte records per cluster to twenty 1,000-byte records 
per cluster. Together, these 140 clusters contain 1,680 large 
records, although some contain as few as four large records 
each, and others contain as many as 20 records each. Simi¬ 
larly, 140 medium-large clusters range from eight 500-byte 
records per cluster to forty 500-byte records per cluster. The 
medium-large clusters contain, collectively, 3,360 medium- 
large records. There are 140 medium clusters, ranging from 
twenty 200-byte records per cluster to one hundred 200-byte 
records per cluster. The medium clusters collectively contain 
8,400 medium records. Lastly, 140 small clusters range from 
forty 100-byte records per cluster to two hundred 100-byte 
records per cluster. The small clusters contain, collectively, 
16,800 small records. 

Database sizes. The size of the database is the sum of all 
the records in all the clusters (4 x 1.68 Mbytes = 6.72 Mbytes). 
For the RTR study, this figure must be a multiple of the factor 
840. Since 6.72 Mbytes = 840 x 8,000 bytes, we can redistrib¬ 
ute the same 6.72-Mbyte database evenly over two-, three-, 
four-, five-, six-, seven-, or eight-back-end configurations. 

For the RTI study, we simply replicate the 6.72-Mbyte 
database n times for any w-back-end configuration. Thus, for 
the eight-back-end configuration, the benchmark database is 
53.76 Mbytes (6.72 Mbytes x 8). 

Benchmark results. We performed separate RTR studies 
on all eight configurations by redistributing the same data¬ 
base for the five data-intensive and two overhead-intensive 
transactions, (minimally, eight database loads for 56 transac¬ 
tion runs). We summarize the results in Figure 6 (p. 56). 
Since there are seven benchmark transactions, there are seven 
RTR graphs in Figure 6. We have also completed separate 
RTI studies on eight configurations by replicating the data¬ 
base seven times for the five data-intensive and two over¬ 
head-intensive transactions (minimally, another eight database 
loads and another 56 transaction runs). Figure 7 on p. 57 
shows the RTI statistics and graphs. 

RTR results on data-intensive operations. Figure 6a shows 
the graph for Transaction 2, which accesses 96 percent of the 


30,240 records in the 6.72-Mbyte database and selects 26 per¬ 
cent (7,560 records of various sizes) of the records retrieved as 
its output. It takes about 12 seconds to complete the RETRIEVE 
operation in the baseline configuration. As the same database 
is distributed to the two-back-end configuration evenly, we 
expect the ideal response time for Transaction 2 to be about 6 
seconds. Since we doubled the back-end number, we expect 
the response time to be cut in half. Instead, the measured 
response time is about 7 seconds. Thus, the overhead incurred 
with two parallel back ends is about 1 second. 

As we increase the back-end number and redistribute 
the same database, correspondingly, at one third each, one 
fourth each, one fifth each, one sixth each, one seventh each, 
and one eighth each for each new configuration, the mea¬ 
sured performance curve continues to slide in Figure 6a. This 
indicates that the additional back ends cut down response 
time. Furthermore, the differences of the measured response 
time and the corresponding ideal response time for all con¬ 
figurations remain the same, about 1 second. In other words, 
this kind of parallelism does not increase the overhead. 

In Figure 6b for Transaction 3, Figure 6c for Transaction 4, 
Figure 6d for Transaction 5, and Figure 6e for Transaction 7, 
the plotted curves all slide downward. The plotted curves 
again closely parallel the ideal curves, regardless of the num¬ 
ber of back ends configured. These figures indicate that use 
of a multiplicity of back ends to reduce the response time of 
data-intensive operations is a promising approach. The curve 
has not yet leveled off at the eight parallel database proces¬ 
sor-store pairs. We may approach a higher degree of paral¬ 
lelism beyond eight for more reductions in response time. 

RTR results on overhead-intensive operations. In Figures 6f 
and 6g, we show the measured RTR statistics on Transac¬ 
tions 1 and 6. As overhead-intensive operations, these two 
transactions access only a small amount (248.8 Kbytes) of 
records data (4 percent of the 6.72-Mbyte database) from the 
database stores. However, they access a relatively large amount 
of in-memory processing (50 percent of the records retrieved 
for qualification or deletion). As we can see from the graphs, 
the curves level off at the five-back-end configuration. Thus, 
for overhead-intensive operations, there is no appreciable 
reduction in response time beyond this degree of parallel¬ 
ism. In other words, overhead-intensive operations do not 
require a highly parallel database computer. Instead, we 
should improve the perfonnance of individual back ends with 
types such as RISC processors. 

RTI data-intensive operations. In Figure 7a, we see that 
all five data-intensive operations, Transactions 2, 3, 4, 5, and 
7, exhibit relatively level curves in all eight configurations, 
that is, from the baseline to the eight-back-end configura¬ 
tion. This indicates the response time of each transaction 
remains unchanged, or invariant, despite a change in data¬ 
base size from the baseline 6.72 Mbytes to 53.76 Mbytes in 
the eight-back-end configuration. In other words, even though 
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Figure 6. Response-time reductions over the multiplicity of parallel back ends for transactions 2 (a), 3 (b), 4 (c), 5 (d), 7 (e), 
1 (f) # and 6 (g). 


the transaction must operate on more data, the response time 
of the transaction remains invariant, as long as the multiplic¬ 
ity of the parallel back ends is proportional to the increase in 
database size. 


Transaction 7 (a DELETE operation) almost doubles the 
response time of Transaction 4 in every configuration. This is 
not surprising. Both access the same amount of data (the 
entire database). However, Transaction 7 must not only 
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Figure 7. Response-time invariances over the multiplicity 
of parallel back ends for data intensive transactions (a) 
and for overhead-intensive transactions (b). 

select 50 percent of the records accessed as does Transaction 
4 (a RETRIEVE operation) but must also delete the selected 
records from the database stores. The curves indicate that it 


takes some 40 seconds for each back end to access and pro¬ 
cess 30,240 records of varying sizes, and it takes some 30 
seconds more for each back end to tag 15,120 records for 
deletion and return them to their database stores for later 
removal. 

Transaction 5 (another DELETE operation) and Transac¬ 
tion 3 (a RETRIEVE operation) have almost identical response 
times for all configurations. This is also not surprising, since 
they both access the same half of the database and select half 
of the records retrieved (for deletion in the case of Transac¬ 
tion 5, or for returning data in the case of Transaction 3). But, 
when the amount of accessing is relatively small and the 
amount of processing is again relatively small, the times for 
writing the smaller number of deletion tags onto database 
stores are mostly overlapped by the accessing and process¬ 
ing times of the other records. 

R7I results on overhead-intensive operations. Figure 7b con¬ 
tains the response times for the overhead-intensive opera¬ 
tions in Transactions 1 and 6. Both curves exhibit a similar 
zigzag pattern. This indicates that, as the benchmark data¬ 
base is doubled, tripled, quadrupled, quintupled, sextupled, 
septupled, and octupled for the different configurations, the 
data has not been evenly redistributed on the corresponding 
configurations. The back end that has the highest number of 
records in a multiback-end configuration has the longest re¬ 
sponse time. Nevertheless, the deviation is within only one- 
half second in Transaction 1, and within 1 second in 
Transaction 6. It is important that, for either transaction, the 
response times deviate upward and downward from the norm. 
The norm for Transaction 1 is about 2.25 seconds. The norm 
for Transaction 6 is about 2.75 seconds. 

MDBS limitations 

There are several hardware and software limitations, but 
the prospects of MDBS-like database computers will be great 
if we can overcome the limitations. 

Hardware. The hardware limitations seem to be related 
to the microprocessor-based Ethernet. A dual-ported, 124- 
Kbyte real memory supports two microprocessors in a back¬ 
plane. This real memory also buffers the messages coming 
from and going to the Ethernet cable. Since messages are 
typically small and infrequent, the real memory size is ad¬ 
equate for buffering messages. Further, messages are sent 
mostly via the point-to-point communication protocol, which 
cannot cause a collision. 

For buffering records, though, the size of the real memory 
is inadequate, since records are individually large, typically 
hundreds of bytes per record, and are transmitted in bulk, 
typically tens, if not hundreds, of records in a batch transmis¬ 
sion. The buffer, therefore, overflows, and some of the records 
are overwritten by records arriving later. This requires the 
sender to repeat the send protocol containing the last batch 
of records. Repetitions of this type markedly slow the broad- 
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cast operations in the second stage of a RETRIEVE COM¬ 
MON operation where each back end broadcasts its subset 
of the second set of records to all the other back ends. As 
one or more back ends must repeat the broadcast opera¬ 
tions, the delay and traffic become intolerable. This, of course, 
severely slows RETRIEVE COMMON. 

Larger buffer memories and interrupt mechanisms. We 
make two recommendations for the purpose of overcoming 
this hardware limitation: provide a larger, dual-ported buffer 
memory in each backplane and provide an interrupt mecha¬ 
nism. Whenever the buffer memory is full, the backplane 
microprocessors interrupt the CPU (in our case, the back¬ 
end computer’s 68020 microprocessor) so the CPU can begin 
the Get process to receive the records in the buffer right 
away. Without this interrupt hardware, there is no quick way 
to alert the back-end computer’s CPU when the buffer is full 
and ready to forward the records. Meanwhile, Get, the sole 
process for receiving records or messages from the Ethernet 
for other processes, may be put on standby, while the CPU 
executes another process. 

Higher bandundth communication network. The Ethernet 
has a bandwidth of 10 Mbps. This is adequate for all the 
database primary operations, except RETRIEVE COMMON, 
where the bandwidth must sustain the cardinality of the sec¬ 
ond set of records in the broadcast mode. Say the cardinality 
is 25,000 records, and the average record size is 750 bytes. 
Then eight subsets of the second set of records of 18.75 Mbytes 
are broadcast at about the same time over the Ethernet. Not 
counting the delay due to eightfold collisions during the pe¬ 
riod of multibroadcasting, the present bandwidth requires at 
least 15 seconds to complete the transmission. Assuming that 
it takes 2 seconds to resend one eighth of the record set 
when a collision occurs, a minimum of 16 seconds are needed 
for the second set. This is also too slow. 

Software limitations. Software limitations are related 
mainly to INSERT, UPDATE, and RETRIEVE COMMON, three 
of the five database primary operations. 

Massive loading of a new database. For our benchmark 
effort, we created benchmark databases. Each database is a 
maximum 6.72 Mbytes per back end. We used INSERT to 
create each database by adding the records one at a time. 
This process is, of course, very time-consuming. It took us 
days to create one database and weeks to create all the nec¬ 
essary databases. We needed a massive loading program that 
could read the attribute-value pairs generated from a record 
generator, format them into records, extract their typed at¬ 
tributes, form their attribute table, and construct the descrip¬ 
tors from the typed attributes. We also needed it to form the 
descriptor table, determine the clusters of all records, form 
their duster table, load the attribute, descriptor, and cluster 
tables onto the metadata disks, and load clustered records 
evenly onto base data disks one cluster of records at a time. 

Control of virtual-memory space. In both UPDATE and 


RETRIEVE COMMON, each back end must first select a set of 
records to be processed. Regardless of its size, this record set 
arrives from the database stores and temporarily resides in 
the virtual memory, awaiting subsequent operations. Typi¬ 
cally, we use hashing techniques to quickly come up with 
addresses for such temporary storage. Unfortunately, the 
addresses are virtual—not real—addresses. Consequently, 
some or nearly all of the records in the temporary storage 
page out of real memory. It takes many paging (I/O) opera¬ 
tions to retrieve the records for the next processing step in 
real memory. This delays or slows down both the UPDATE 
and RETRIEVE COMMON operations. 

We make the following recommendations here for the 
purpose of overcoming this limitation: 

• The operating system provides us with a system func¬ 
tion that enables us to lock a certain number of virtual 
memory pages into real memory. So, as long as we use 
these locked pages for the temporary storage of our 
records, there will be no paging operations involved. 

• On the basis of the size of available locked pages, we 
can partition either an UPDATE or a RETRIEVE COM¬ 
MON into a series of updates (in which each update 
works on a new batch of records in the locked pages) 
or a series of RETRIEVE COMMONS. For example, in 
the first set of records each RETRIEVE COMMON works 
on a new batch of records in the locked pages. In other 
words, each batch of the first record set is processed 
against the entire second record set, although the sec¬ 
ond record set can also be brought in for processing 
one batch at a time (as long as both batches can be 
accommodated in the available locked pages). 

System process priority. The Get process is a very high- 
priority process that should be constantly listening over the 
communication net and anticipating messages, data, or trans¬ 
actions heading its way. It is akin to a real-time process. In 
other words, the CPU of the back end should never deacti¬ 
vate a Get. The operating system should provide a system 
function that enables us to set the priority of one of our 
processes with the capability of real time. 

DESPITE THE LIMITATIONS the prospects for a micro¬ 
processor-based, multiback-end database computer, such as 
MDBS, are good. These prospects are rooted in the following 
categories: external technology and intrinsic architecture. 

We note several technology prospects. 

1) All the hardware and software limitations discussed pre¬ 
viously can be overcome with present computer tech¬ 
nology. We do not have to wait for any distant technology 
for solutions. 


58 IEEE Micro 






Project support 


The work reported here is now supported by funds from 
the Naval Research Laboratory and the Naval Postgraduate 
School. It began in 1980 with a proposal for an equipment 
grant submitted by the author to Digital Equipment Corp. 
In 1981 DEC and the Office of Naval Research funded the 
initial MDBS with a VAX 11/780 as the controller com¬ 
puter, two PDF 1 l/44s and their disk drives as back ends, 
and some PCLs as the communication network. The use of 
the 11/780 for the development of MDBS software was the 
main reason that we had also employed the VAX unit as 
the controller. This was before the advent of the micropro¬ 
cessor and the arrival of the microprocessor-based Ethernet. 
All three of our computers were minicomputers, with the 
VAX using the VMS operating system, and the PDP. the 
RSX operating system. The Ethernet functional specifica¬ 
tion had just been announced. We used three PCL cables 
to “emulate” the Ethernet specification. The two-back-end 
database computer was operated in the Laboratory for 
Database Systems Research at Ohio State University. 

The MDBS software requirements were largely the work 
of J. Menon. 1 D. Kerr, A. Orooji, and S. Demurjian accom¬ 
plished the principal supervision and development. 2 3 4 * 

The process structure is essentially the same to this day, 
although we have updated the original code many times. 
We wrote all processes in C, with five of them ainning in 
VMS and the other five in RSX. Get and Put in VMS commu¬ 
nicate with Put and Get in RSX. The Office of Naval Re¬ 
search supported our software development effort as well. 


In 1983, the Laboratory' for Database Systems Research 
moved to the Computer Science Department of the Naval 
Postgraduate School. We gave MDBS’s VAX 11/780 to Ohio 
State University and replaced it with one of the two Com¬ 
puter Science Department’s VAX ll/780s. In 1985, we be¬ 
gan to phase out and replace all the minicomputers and 
their communications network with microprocessor-based 
workstations and the Ethernet. We also began to add as 
many back ends as funds permitted. After we reached the 
round number of eight, we began to upgrade the micro¬ 
processors from 68010 to 68020 chips, and the real memo¬ 
ries in both size and speed. The present configuration 5 is 
depicted in Figure 1. We recently acquired two Sun-4 work¬ 
stations with RISC-based microprocessors. However, we 
have not decided whether we should replace all the eight 
back ends with Sun-4s or simply add the.se two to the 
existing configuration for a heterogeneous, 10 back-end 
configuration. The replacement and upgrade were funded 
by the Naval Postgraduate School equipment program and 
the Department of Defense STARS (Software Technology 
for Adaptable, Reliable Systems) program. 

MDBS at one time was known as MBDS. However, be¬ 
cause of a trademark conflict, we changed the name of our 
database computer to its present name. MDBS is used as a 
research vehicle for the study of design analyses 6 * ' 8 and 
performance evaluations 911 and their methodologies. 12 We 
also use it to study the database system architectures of the 
future. 13 ' 17 


2) All the back-end hardware requirements for a database 
computer require no special-purpose computer compo¬ 
nents or devices. They can be met with off-the-shelf 
microprocessor-based hardware, such as Sun-4 work¬ 
stations or super PCs, each of which can support several 
hard disks of various capacities. 

3) The LAN hardware requirements for a database com¬ 
puter can also be met with an Ethernet-like LAN with 
wider bandwidths, higher transmission rates, and reli¬ 
able multibroadcast capabilities. Such LANs can be made 
available commercially. 

4) Microprocessor technology is progressing at a higher 
rate than that of both minicomputers and mainframes. 
We should be able to ride on the success of these faster 
advances. 

5) In storage technology, Winchester-type disks and disk 

arrays based on that principle are improving more rap¬ 

idly than are the densities and capacities of standard 

disk drives. By relying on the gains in Winchester-type 


drive technology, we should have a better mix and choice 
of disk drives with different capacities and numbers for 
separate metadata, base data, and page stores. 

We also note some architectural prospects. 

1) Parallel architecture uses identical back ends. Also, da¬ 
tabase back-end software and metadata are replicated. 
Only base data require redistribution. Thus, the cost of 
any additional hardware and software is minimized. It is 
rather straightforward to increase parallelism readily and 
cost-effectively. 

2) An increase in parallelism can be scaled for the purpose 
of achieving the desired response time for a given trans¬ 
action and/or a certain capacity of a database. 

3) The increase in performance for a given response time 
reduction or a capacity increase for a response time in¬ 
variance is directly proportional to the number of back 
ends and corresponding stores added. 


December 1991 59 








MDBS 


The MDBS architecture is the most effective and efficient 
one in reducing the response time of a transaction. It is also 
the most effective and efficient architecture to maintain the 
same response time of a transaction, despite the increase in 
size of a database. IP 
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This database processor accelerates nonindexed relational database queries. Rinda is com¬ 
posed of content search processors and relational operation accelerating processors; the former 
search rows stored in disk storage, and the latter sort rows stored in the main memory. The 
processors connect to a general-purpose host computer with channel interfaces. Performance 
improves substantially as Rinda executes queries three to 100 times faster than conventional 
database-management system software in a benchmark system. 
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arious database machine architectures 
have been proposed and developed 
to solve the performance problems of 
relational databases. Database ma¬ 
chines fall into two classes on the basis of their 
functions: computers and processors. Database 
computers, such as the Britton Lee Intelligent Da¬ 
tabase Machine and the Teradata DBC/1012, are 
independent of their host computers and per- 
form all database-management functions by them¬ 
selves. Database processors, on the other hand, 
depend on their hosts and perform only certain 
database functions. For example, CAFS 1 is dedi¬ 
cated to ICL mainframes, and its functions are 
limited to searching and filtering table rows stored 
in disk storage. IDP 2 is an optional processor for 
Hitachi’s largest mainframe, and it can perform 
only sorting and merging of keys stored in the 
main memory. In general, database processors 
achieve higher performance at lower costs, be¬ 
cause specialization in their hardware is restricted 
to database functions that have caused severe 
performance problems in general-purpose 
computers. 

We can classify relational database access into 
queries and updates, which we can further clas¬ 
sify into nonindexed and indexed queries and 
updates. 3 Database systems execute indexed que¬ 
ries and updates efficiently using indexes created 


before the execution. A typical indexed query is a 
selection of a single row in a table by exact match¬ 
ing of a unique search key. With such queries or 
updates, current database-management systems 
perform fairly well. For nonindexed queries, how¬ 
ever, systems cannot take advantage of indexes. 
Examples of nonindexed queries are a selection 
of multiple rows by partial matching of a charac¬ 
ter string, a join of two tables with nonunique join 
keys, and an aggregation of a whole table with 
groups of rows in it. A general-purpose computer 
consumes a lot of processing time executing such 
queries because searching and sorting are 
computationally intensive operations. 4 

At NTT, we developed a database processor 
called Rinda to accelerate nonindexed relational 
database queries. 3 Rinda is composed of content 
search processors and relational operation accel¬ 
erating processors. 5 A content search processor, 
or CSP, searches rows in disk storage and trans¬ 
fers the selected rows to the main memory. A 
relational operation accelerating processor, or 
ROP, sorts the rows stored in the main memory. 
The CSP and ROP perform their work using hard¬ 
ware specialized for searching and sorting. Rinda 
is connected to a general-purpose host—our DIPS 
series mainframe 6 7 —with channel interfaces, and 
is controlled by a Rinda control program running 
on the host. 
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Relational database processor 


Rinda architecture 

In relational database systems, search¬ 
ing and sorting are basic operations. 

Searches select rows and columns from 
a table according to predicates in a query 
and are the first step in query execution. 

When a table is stored on disk, disk reads 
are also needed. There are two problems 
with searches. For nonindexed queries, 
they take a great amount of CPU time 
because all rows in a table must be quali¬ 
fied. Transferring all rows from the disk 
to the main memory also takes extra time. 

A solution to these problems is an intelli¬ 
gent disk controller that performs searches 
at the disk storage. 

Sorts order rows in a table using a key, 
which may be composed of several col¬ 
umns. Systems can perform join, nested, 
and aggregate queries efficiently when 
rows are sorted. The major problem in 
sorting is CPU time consumption, because 
its complexity with ordinary algorithms 
is N* logC/V), where N is the number of rows. Another prob¬ 
lem stems from memory storage because additional disk ac¬ 
cesses are required when memory size is insufficient. A 
solution to these problems is a hardware sorter with a large 
memory, which sorts rows in time N. 

Design considerations. Our goal in developing Rinda 
was to relieve host computers from heavy loads caused by 
nonindexed queries. Therefore, Rinda had to satisfy the fol¬ 
lowing requirements: 

1) It should be a database processor, so the existing host 
computer and disk storage would not need to be re¬ 
placed. 

2) The application program interface should be SQL to avoid 
any modification of user programs. 



Figure 1. Rinda system organization. 


Table 1. 

Primary CSP functions. 

Function 

Description 

Predicates 

Specified in a WHERE clause 

Comparison 

<column> <comp-op> <value> 

In 

<column> [NOT] IN cvalue list> 

Like 

<column> [NOT] LIKE <pattern> 

Null 

<column> IS [NOT] NULL 

Boolean expressior 

Any combination of predicates 

Column selection 

SELECT <column list> 

Set function 

Count(*) 


To meet the first requirement, Rinda uses the two special- 
purpose processors just introduced: CSPs and ROPs. A CSP 
connects to a host computer and a disk controller with chan¬ 
nel interfaces, forming an intelligent disk controller. An ROP 
that uses a hardware sorter also connects to the host by a 
channel. To meet the second requirement, we based Rinda’s 
functions on a subset of SQL functions. Software on the host 
computer fills the gaps between full-set SQL and Rinda. 

Hardware architecture. Figure 1 shows a typical Rinda 
system organization. The major components are a DIPS se¬ 
ries host computer, standard disk controllers and disk drives, 
CSPs, and ROPs. Database size and performance require¬ 
ments determine the number of CSPs and ROPs in a system. 
For example, if the database is very large, several CSPs used 


in parallel reduce the I/O time. Channel commands and or¬ 
der tables created by the Rinda control program, or RCP, and 
running on the host control the CSPs and ROPs. 

A CSP searches a database table stored on a disk, selects 
rows and columns specified by a query and transfers only the 
results to the host. Table 1 shows its primary functions, which 
cover most single-table queries of relational databases. The 
CSP performs these functions at the disk’s data-transfer rate. 

An ROP sorts rows in a table transferred from the main 
memory of the host and transfers the results back to the host. 
Its primary functions, listed in Table 2, include removal of 
unnecessary rows to accelerate join and nested queries. The 
ROP performs these functions at the data-transfer rate of the 
channel. We discuss CSP and ROP hardware structure in de¬ 
tail later. 
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Table 2. Primary ROP functions. 

Function 

Description 

Sorting 

Order By <column[ASC/DESC] list> 
(also used for joins, subqueries, 
and Group-By clauses) 

Filtering 

Removal of unnecessary rows (used 
for join and nested queries) 

Duplicate removal 

DISTINCT <column list> 

Set function 

Count(*) with a Group-By clause 


Software architecture. Figure 2 shows the host computer’s 
software structure. The Rinda control program is attached to 
an existing relational database-management system to form a 
new integrated DBMS. The language-processing subsystem 
analyzes queries written as SQL statements. If the query is 
nonindexed, the system calls RCP functions, which optimize 
and execute queries using Rinda. 

Figure 3 illustrates a typical procedure to execute a 
nonindexed query. First, the RCP sends a command to the 
CSPs, which transfer selected rows to the RCP buffers in the 
main memory. Next, the RCP sends another command to an 
ROP, which sorts the rows in the buffers. Finally, the RCP 
returns the results to the user program. 

The RCP also allows a table to be stored over multiple 
disks and conducts parallel CSP search¬ 
ing on the disks. The database control 
subsystem controls RCP query execution 
to keep the database consistent when 
nonindexed queries and indexed updates 
execute concurrently in multiuser envi¬ 
ronments. 

Related work. Several searching and 
sorting processors have been developed. 

For example, CAFS 1 is a searching pro¬ 
cessor with functions similar to those of 
the CSP. However, the CSP has proper¬ 
ties different from those of other search¬ 
ing processors. First, its design is based 
on SQL functions, and it supports the null 
value. Second, it deals with the page for¬ 
mat of an existing DBMS so no data con¬ 
version is required to introduce Rinda to 
installed user database systems. As for 
sorting processors, existing hardware sort¬ 
ers 8 can sort only a small number of rows 
at one time because the number of pro¬ 
cessing elements is limited. The ROP 
solves this problem with multiway merg¬ 
ing. Moreover, we integrated sorting and 



Figure 2. Software structure on a host computer. 



Figure 3. Typical query execution using Rinda (AP indicates the application 
program). 
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Relational database processor 


Figure 4. CSP block diagram. 


filtering in the ROP to use the three-phase 
join method. 

Content search processor 

A database consists of database spaces, 
each of which is a collection of pages 
stored in one or more continuous disk 
areas. A database space may be distrib¬ 
uted over several disks. When a table is 
created in a database space, adjacent 
pages are assigned to the table. As the 
table grows, a constant addition of adja¬ 
cent pages increases the table. Thus, a table is stored in mul¬ 
tiple disk extents. Each page holds rows of a single table, and 
no row is stored across page boundaries. Each table contains 
control records indicating its disk extents in separate pages, 
and the RCP uses this information to perform CSP searches. 

CSP organization. A CSP search can be performed in two 
ways, synchronously and asynchronously. In the synchronous 
mode, a data stream from a disk processes on the fly, while in 
the asynchronous mode, buffers decouple reading and search¬ 
ing. The CSP performs only asynchronous searches for the 
following reasons: 

• A disk controller has an error-detecting and -correcting 
facility for data pages. This facility completes after the last 
byte of the page transfers from the disk to the buffer. 

• We had designed the page format without any consider¬ 
ation of hardware-searching mechanisms. Changing the 
page format from software-oriented to hardware-oriented 
was impossible because it would require database con¬ 
version for existing systems. 



keep up the data-transfer rate, the CSP uses multiple in¬ 
put buffers that hold pages transferred from the disk, and 
multiple output buffers that hold new pages containing 
selected rows. 

4) When one of the output buffers fills, the CSP transfers the 
page to the host asynchronously. 

This procedure continues without any interruption to the 
host until the CSP has read all the extents or until all the RCP 
buffers are occupied. In the latter case, the RCP saves the 
selected rows on a work disk and repeats the search 
procedure. 

Evaluation of search conditions. A search condition is 
expressed by predicates and Boolean operators And, Or, and 
Not. In SQL, when the value is undefined, each column may 
have a null value instead of zero or space. As a result, the 
truth value of a predicate in the condition may be true, false, 
or unknown. Therefore, the three-valued logic (true, false, 
and unknown) shown in Figure 5, instead of two-valued logic 
(true and false), defines Boolean operators. 


The CSP transfers pages successively from a disk to its buff¬ 
ers using the multitrack-read method and scans the page in 
the buffer before the next page comes. Therefore, the CSP 
performs a search within the page-transfer time, similar to 
doing it on the fly. Figure 4 shows a block diagram of the CSP. 

Search operation flow. Rinda systems perform searches 
as follows: 

1) The RCP running on the host acquires the RCP buffers 
and prepares a set of channel commands and an order 
table for each CSP. An order table contains disk access 
information that includes the disk unit number, extent 
addresses to be searched, the RCP buffer size, and a 
search condition. 

2) Channel commands send the order tables from the host 
to the CSPs. 

3) According to the order table, each CSP generates new 
channel commands to the disk to read pages successively 
using the multitrack-read method. When the first page 
arrives, the CSP begins to select rows from the page. To 


And 

True 

False 

Unknown 

True 

True 

False 

Unknown 

False 

False 

False 

False 

Unknown 

Unknown 

False 

Unknown 



Or 

True 

False 

Unknown 

True 

True 

True 

True 

False 

True 

False 

Unknown 

Unknown 

True 

Unknown 

Unknown 



Not 

True 

False 

Unknown 


False 

True 

Unknown 


Figure 5. Truth tables of the three-valued logic. 
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If unknown is replaced by false in the evaluation of predi¬ 
cates, the three-valued logic is localized into two-valued logic, 
and the CSP implements simply. A where clause in an SQL 
statement, WHERE <search condition^ selects rows whose 
final truth value of the <search condition> is true. Therefore, 
unknown can be replaced by false in the last stage of the 
evaluation. In the truth table shown in Figure 5, the 
Not(unknown) column is important. If unknown is simply 
replaced by false, the result may be unreasonable because 
Not(false) is true while Not(unknown) is false. Therefore, if 
there is no Not operator in the search condition, unknown 
can be replaced by false, and the three-valued logic can be 
localized into two-valued logic. 

The following procedure transforms any search condition 
to another search condition having the same effect without 
Not operators: 

1) Transform a search condition to a conjunctive canonical 
form. 

2) If there is no predicate with Not, the procedure ends. 

3) If there is a predicate with Not, remove Not in the predi¬ 
cate and reverse the comparison operator in the predi¬ 
cate. For example, Not(columnl > x) is replaced by 
(column 1 < x). 

In Rinda systems, the RCP transforms a search condition to 
a conjunctive canonical form without Not operators, and cre¬ 
ates an order table for the CSP. The CSP replaces unknown 
with false in the evaluation of predicates and applies two¬ 
valued logic. 

Accelerating processors 

The purpose of the ROP is to accelerate join queries. We 
developed a new join algorithm, which makes the best of a 
hardware sorter. 

Join algorithms. There are three principal types of con¬ 
ventional join algorithms: nested-loop, 9 sort-merge, 9 and hash 10 
algorithms. Nested-loop join algorithms repeatedly compare 
each row in the outer table with all rows in the inner table. 
This algorithm is practical only if both tables are small, 
because it requires a very large number of comparisons. Hash 
join algorithms split both source tables into several buckets 
using a hashing function and execute comparisons in each 
bucket when every row has the same hashing value. There¬ 
fore, hash join algorithms are suitable for parallel processors. 
Sort-merge join algorithms sort both tables to decrease the 
amount of merge-join computations to a linear order. We 
selected a sort-merge join algorithm for Rinda because spe¬ 
cialized hardware can rapidly execute sorting. 

Rinda’s three-phase join method 411 based on the sort-merge 
algorithm consists of filtering, sorting, and merge-join phases. 
Most unnecessary rows are removed in the filtering phase. 
Remaining rows, each of which is a candidate of the join, are 


sorted in the sorting phase. After this, sorted rows are merged 
together in the merge-join phase. 

The filtering phase uses hashed bit arrays 12 to remove un¬ 
necessary rows. Of the three phases, the filtering phase in¬ 
volves the most rows. Therefore, we decided that specialized 
hardware should execute this step. Sorting consumes much 
CPU time, so we chose specialized hardware to execute the 
sorting step. Filtering decreases the number of rows, and the 
remaining rows are sorted during the sorting phase. There¬ 
fore, the RCP on the host executes the merge-join phase with 
little computation. 

ROP organization. Figure 6 shows a block diagram of the 
ROP, which performs filtering and sorting. Its main functional 
blocks are the hasher and sorter. The key extractor and row 
transfer blocks were implemented for a row-level pipeline. 

Rows are generally composed of several columns with dif¬ 
ferent data types such as integer, decimal, and character string. 
A null value may also appear in a column. The ROP has key- 
extractor hardware to compose a fixed-length and directly 
comparable internal key from a row. Using the internal keys, 
simple hardware can rapidly execute filtering and sorting. 

The ROP has a single memory for storing internal keys, 
rows, and hashed bit arrays. The ROP moves the boundaries 
between these areas dynamically when it stores keys and 
rows. The constant ratio of the storage capacity determines 
the size of the bit array to keep the loading factor low. For 
example, if storage capacity increases, so does the size of the 
bit arrays. Keys and rows are stored in the remaining ROP 
memory. 

Filtering block. Since unnecessary rows remain from col¬ 
lisions of hashing-function values, we needed a sophisticated 
hashing function to decrease the number of collisions. An 
ideal hashing function would distribute all keys uniformly 
over the addressing space of a hashed bit array. Lum, Yuen, 
and Dodd 13 and Knott 14 compared division, multiplication, 
folding, and other hashing functions on the basis of the num¬ 
ber of collisions, using fixed-length short keys. The division 


From/to host 



Figure 6. ROP block diagram. 
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Figure 7. Multiplication-folding function. 

method produced good results with fewer collisions for the 
unknown set of keys. However, in the filtering phase, the 
division method cannot be applied directly because internal 
keys may be too long. We determined that Rinda’s hashing 
function should 

• operate well when the loading factor, the number of 
hashing keys over the size of a bit array, is rather small; 

• decrease the number of collisions (but it need not handle 
overflow when a collision occurs); and 

• use the same hardware mechanism to hash any keys 
having various data types and lengths. 

For hashing long keys, the folding method works well. It 
divides keys into several short fragments and folds them 
using the Exclusive-Or (XOR) operator. However, in charac¬ 
ter strings, especially in Japanese 16-bit kanji strings, the prob¬ 
ability of 1 occurring in each bit is not even. Thus, hashed 
results have biases if the simple folding method with 1-byte 
or 2-byte fragments is used. Shifting or reversing the frag¬ 
ments in the bit order solves this problem. 

We developed a new multiplication-folding function 11 for 
Rinda: a folding method with bit shifting and multiplication. 
The multiplication randomizes the character code, and the 
folding handles long keys. Compact 
specialized hardware with XOR and 
simple multiplication circuits imple¬ 
ments the method. 

Figure 7 is a schematic diagram of 
the multiplication-folding function. 

Keys are divided into several short seg¬ 
ments. The first segment is multiplied 
by a prime number and rotated lap- 
around to distribute the value. Then, 
the second segment is multiplied by 
the same prime number and folded on 
the first segment with XOR. This pro¬ 
cedure repeats until all segments are 
folded. This multiplication-folding func¬ 
tion yields a high filtering factor for any 


type and length of keys. 

Sorting block. Many sorters can increase sorting speed 
using parallel and pipelined hardware. 8 However, such hard¬ 
ware sorters are too large for database processors because 
their hardware size depends on the number of sorted keys. 
We decided that Rinda’s sorter should 

• handle a variable number of sorting keys at query 
execution time, 

• achieve a high sorting speed regardless of the number 
of keys, 

• handle any length and type of key efficiently because 
keys may be long and composed of several data types, 

• use compact hardware, and 

• permit easy expansion of the allowable number of keys. 

The multiway merge sorter 15 using a sorting array 16 meets 
these requirements. As Figure 8 shows, the sorter implemented 
in the ROP consists of a sorting array, a merge controller, and 
working storage, which is the key area of ROP memory. The 
sorting array is composed of cascade-connected sorting ele¬ 
ments with a one-dimensional linear array structure. Each 
element consists simply of two memories and a comparator 
as main circuits. The sorting array executes comparison- 
transferral actions synchronously in a pipelined, parallel man¬ 
ner. Each sorting element compares two keys and selects 
and transfers the smaller or larger one to the neighboring 
element through a dedicated data path controlled by the pre¬ 
vious comparative result. 

The sorter can compare a large number of keys in a series 
of &-way merge operations. In each &-way merging stage, k 
strings in the working storage are merged at once and then 
restored as a single string. The size of this string is the total 
number of keys in all k strings. 

Using a new data-driven string-selection technique, the 
sorter performs a multiway merge operation. Keys, read from 
the working storage, are identified with bank tags. The sort- 



Figure 8. Multiway merge sorter. 
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Figure 9. Continuous multiway merging using bank tags. 


ing array compares them and outputs 
them with their bank tags. 

Figure 9 shows a continuous key flow 
using the bank tag. The figure shows an 
example of four-way merging in descend¬ 
ing order. In the Ml phase, the sorting 
array is filled with the largest keys in all 
source strings. These largest keys located 
at the top of the strings. In the M2 phase, 
merging proceeds successively. The larg¬ 
est key in the sorting array is immedi¬ 
ately output with the bank tag. The 
second-largest candidate key is limited to 
any of the remaining keys in the sorting 
array or to the top key in the string indi¬ 
cated by the output key’s bank tag. There¬ 
fore, this top key is input to the array 
and is compared with the remaining can¬ 
didate keys. The sorter performs succes¬ 
sive multiway merging using these 
alternate key output-input operations until 
all source strings are emptied. 

Performance 

We evaluated Rinda’s performance us¬ 
ing the extended Wisconsin Bench¬ 
mark. 17,18 The database table consisted of 
100,000 rows, and each row was 208 bytes 
long. The Rinda system used the NTT 
DIPS-V30E superminicomputer as the 
host, and two CSPs and an ROP. The 
database resided on two 1.3-Gbyte disks 
whose data-transfer rate was 3 Mbytes per 
second. 

A user program on the host computer measured the elapsed 
time from the start of query execution to the return of the last 
result row to the user. To measure the performance with a 
real application, we transferred all the result rows to the user 
program one by one using fetch operations. We classified 
the queries executed into four categories: simple selections, 
joins, Min functions, and Count functions, as listed in Table 3 
on the next page. The Count functions are additional queries 
to evaluate Rinda’s pure hardware performance. 

Results. Figure 10 shows the performance improvement 
achieved with Rinda. The speedup is the elapsed time with¬ 
out Rinda divided by the elapsed time with Rinda. It is shown 
with the CPU and I/O times separated. The I/O time includes 
the processing time of the CSPs and the ROP and the time 
required to access the work disk. The system executes the 
queries for 1-percent and 10-percent selections and scalar 
Min and Count with only the CSPs but performs the others 
with a combination of the CSPs and the ROP. Table 4 lists the 
query execution time in seconds with Rinda as well as the 


times of some other database machines measured by DeWitt 
et al. 18 The query execution times do not include fetch opera¬ 
tions but do include storing the result rows on the work disk. 
The Teradata machine consisted of 24 processors, while the 
Gamma machine had 17 processors. 

Considerations. Figure 10 shows that the speedup of 
queries with CSPs depends on the number of result rows. In 
general, the speedup is smaller as the results are larger. For 
example, the speedup is more than 30 times in the scalar 
Count in which no rows are transferred, while it is less than 
four times in the 10-percent selection. The reason is that the 
RCP requires more CPU time to transfer rows from the result 
table to the user program. In other words, the fetch opera¬ 
tions take a long time when the number of result rows is 
large, because the host computer must perform them seri¬ 
ally. The speedup of the scalar Min is less than that of the 
scalar Count, because the Count is directly supported by the 
CSPs while the Min is not. As for the Min, the specified 
column’s values of all rows are transferred from the CSPs to 
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Table 3. SQL statements executed. 

Query 

Type 

SQL statement 

Selectior 

1 percent 

SELECT * FROM HUNKA WHERE 
UNIQUE1 >=50000 AND 
UNIQUE1<51000 


10 percent 

SELECT * FROM HUNKA WHERE 
UNIQUE1 >=50000 AND 

UNIQUE1 <60000 

Join 

AselB 

SELECT* FROM HUNKA A, 

HUNKB B WHERE 
A.UNIQUE1=B.UNIQUE1 AND 
A.UNIQUE1<10000 


CselAselB 

SELECT* FROM HUNKA A, 

HUNKB B, TENKAC WHERE 
C.UNIQUE1=A.UNIQUE1 AND 

A. UNIQUE1 =B. UNIQUE 1 AND 

A. UNIQUE1<10000 AND 

B. UNIQUE1<10000 

Min 

Scalar 

SELECT MIN(UNIQUEI) FROM 
HUNKA 


Group-By 

SELECT MIN(TWOTHOUS) FROM 
HUNKA GROUP BY HUNDRED 

Count 

Scalar 

SELECT COUNT(*) FROM 

HUNKA 


Group-By 

SELECT COUNT(*) FROM 

HUNKA GROUP BY HUNDRED 


the RCP, which determines the minimum value. 

Comparison of the scalar and Group-By in the Min or Count 
functions in Figure 10 shows the speedup achieved with an 
ROP. The additional sorting load does not reduce the speedup, 
because the ROP sorts dozens of times faster than the host 
computer. The join queries also demonstrate the ROP’s effect 
in contrast with the 10-percent selection, which does not use 
the ROP. 

Table 4 shows that Rinda is faster in the selections and 
slower in the joins and Min functions than the Teradata and 
Gamma machines. The times differ because the host com¬ 
puter has almost nothing to do for the selections, while it 
must perform the merge-join phase for the joins and value 
comparisons for the Min function. These loads are somewhat 
smaller than those of the original queries. However, they still 
require relatively long processing times in single small-scale 
general-purpose processors. 


Table 4. Query execution times in seconds. 

Query 

Type 

Rinda 

Teradata 

Gamma 

Selection 1 percent 

5.2 

18.6 

13.4 


10 percent 

9.0 

14.9 

12.7 

Join 

AselB 

136.8 

235.6 

35.8 


CselAselB 

128.5 

95.7 

37.9 

MIN 

Scalar 

31.7 

18.3 

15.5 


Group-By 

47.3 

27.1 

19.4 


30 
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Figure 10. Speedup with Rinda (left) and without Rinda (right). 


68 IEEE Micro 




























































Merits and limitations 

Rinda accelerates searching and sorting 20 to 50 times, and 
its total speedup of queries is in the range of three to 100 
times. Rinda is generally very efficient when the source table 
is large and the result table is small. 

However, Rinda has limitations, the first of which is con¬ 
current execution with updates. To guarantee transaction 
serializability, the RCP (Rinda control program) must lock the 
whole table before CSP searching, resulting in a locking over¬ 
head with nonindexed queries and time delays with indexed 
updates. Using a dirty read facility that requires no locks for 
queries executed by Rinda solves this problem. The second 
limitation involves multiple query execution. The RCP 
executes multiple queries concurrently, while each CSP and 
ROP accepts only one request at a time. Therefore, simulta¬ 
neous execution of two queries that require the same CSP or 
ROP increases response time. 


Rinda relieves its host computer of the heavy 

loads caused by nonindexed queries. The CSP, a special- 
purpose processor with hardware for searching, selects rows 
from a disk at the disk’s data-transfer rate. The ROP, another 
processor with specialized hardware, sorts rows selected by 
the CSPs. For join queries, the ROP uses the three-phase join 
method with a hardware filter and sorter. Rinda substantially 
reduces the host computer’s CPU and I/O times. Our perfor¬ 
mance study shows that Rinda accelerates nonindexed que¬ 
ries from three to 100 times, compared with conventional 
DBMS software. 

Rinda operates with several business database systems. The 
businesses use Rinda mainly to retrieve rows by ad hoc que¬ 
ries or to get statistical reports from very large tables. Rinda’s 
role is increasing as database applications become more 
advanced. P 
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Ulrich, + , M-M Jun 91 22-25, 88-94 

Artificial intelligence; cf. Cognitive science 
Associative memories 

associative caches and associative circuits as accelerators for large 
databases. Faudemay, Pascal, + , M-M Dec 91 22-34 


B 

Bibliographies 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 

Biomedical signal analysis 

myoelectric signal analysis using computer systems and signal 
processing techniques. Knaflitz, Marco, +, M-M Oct 91 12-15,48-58 

Book reviews 

Computer Dictionary. Mateosian, Richard, M-M Aug 91 43 


Intel’s Official Guide to 386 Computing (Edelhart, M.; 1991). 

Mateosian, Richard, M-M Jun 91 50-51 
Postcript Language Reference Manual, 2nd edn. (Systems, A.; 1990). 

Mateosian, Richard, M-M Apr 91 28-29 
Superscalar Microprocessor Design (Johnson, M.). Mateosian, 
Richard, M-M Jun 91 51, 102 

The Creative Mind—Myths and Mechanisms (Boden, M. A.; 1990). 

Mateosian, Richard, M-M Oct 91 4-6 
The Emperor’s New Mind (Penrose, R.; 1991). Mateosian, Richard, 
M-MAug 91 42-43 
Brain; cf. Cognitive science 
Buffer memories; cf. Cache memories 
Bulk memories; cf. Mass memories 
Business; cf. Computer industry 


C 


Cable shielding; cf. Wire communication cable shielding 
Cache memories 

associative caches and associative circuits as accelerators for large 
databases. Faudemay, Pascal, + , M-M Dec 91 22-34 
RST cache memory design for tightly coupled Clipper-based 
multiprocessor system. Prete, CosimoA., M-M Apr 91 16-19, 40-52 

Cellular logic 

Datawave single-chip multiprocessor for video applications. Schmidt, 
Ulrich, + , M-M Jun 91 22-25, 88-94 

Circuit simulation 

KMDS expert system for integrated hardware/software design of 
microprocessor-based digital systems. Kuo, Yau-Hwang, + , M-M 
Aug 91 32-35, 86-92 

Cognitive science 

book review; The Creative Mind—Myths and Mechanisms (Boden, 
M. A.; 1990). Mateosian, Richard, M-M Oct 91 4-6 
book review; The Emperor’s New Mind (Penrose, R.; 1991). 
Mateosian, Richard, M-M Aug 91 42-43 
Communication switching; cf. Message switching; Packet switching 
Communication system performance; cf. Computer network 
performance 
Compilers 

Metaflow architecture using deferred scheduling and 
register-renaming instruction shelf to manage out-of-order 
execution. Popescu, Val, + , M-M Jun 91 10-13, 63-73 

Computer applications 

project management, site, track, and mobile computers used in 
construction of England-to-France tunnel (Micro World). Kirmann, 
Hubert, M-M Jun 91 4-6 

Computer architecture 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 
evolving architectures for microprocessors; overview. Slater, Michael, 
M-M Feb 91 96-95 

IBM RISC System/6000 architecture and performance; benchmark 
details. Oehler, Richard R., + , M-M Jun 91 14-17, 56-62 
KRPP and DSNS superscalar architectures using task and instruction 
level parallelism. Fukuda, Akira, + , M-M Aug 91 16-19, 50-61 
Metaflow architecture using deferred scheduling and 
register-renaming instruction shelf to manage out-of-order 
execution. Popescu, Val, + , M-M Jun 91 10-13, 63-73 
redesign of Clipper C400 RISC architecture. Sachs, Howard G., + , 
M-M Jun 91 18-21,74-80 
Computer buses; cf. Data buses 
Computer control; cf. Digital control 

Computer engineering education; cf. Computer science education 
Computer graphics; cf. Visual languages; Visualization 
Computer graphics languages; cf. Visual languages 
Computer industry 

growth of computer industry in Taiwan (Software Report). Kahaner, 
David K., M-M Jun 91 39-41 
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Computer interfaces 

IEEE PI 175/D 11 draft standard on computing system tool 
interconnections overview (Micro Standards). Warren, Carl, M-M 
Dec 91 83-85 

Lotus Development Corp. vs. Paperback Software International; 
court’s failure to recognize functionality of standardization (Micro 
Law). Stem, Richard M., M-M Feb 91 48-51 
real-time computing with IEEE Futurebus+. Sha, Lui, + , M-M Jun 91 
30-33, 95-100 

Computer interfaces; cf. Measurement-system data handling; 

Microcomputer interfaces 
Computer language processors; cf. Compilers 
Computer languages 

Alphom remote procedure call environment for fault-tolerant, 
heterogeneous, distributed systems. Aschmann, Hans-Ruedi, + , 
M-M Oct 91 16-19,60-67 

book review; Postcript Language Reference Manual, 2nd edn. 
(Systems, A.; 1990). Mateosian, Richard, M-M Apr 91 28-29 
Computer languages; cf. Visual languages 
Computer network performance 

interprocessor communication performance evaluation of five types of 
hypercube and grid-topology multi computers. Zhang, Xiaodong, 
M-M Apr 91 12-15, 52-55t 

Computer networks; cf. Local area networks; Multiprocessing 
Computer peripherals; cf. Microcomputer peripherals 
Computer pipeline processing; cf. Pipeline processing 
Computer science 

book review; Computer Dictionary. Mateosian, Richard, M-M Aug 91 
43 

Computer science education 

adapting curriculum materials for different sequences of 
microprocessing courses. Hall, Douglas V., M-M Feb 91 34-37, 
82-83 

advanced educational microprocessor system used as classroom 
demonstration tool and laboratory instrument. Pollard, L. Howard, 
+ , M-M Feb 91 22-25, 78-79 

microcomputer-interfacing laboratory projects. Fulcher, John A., M-M 
Feb 91 18-21,75-78 

microprocessor education curriculum incorporating logidules, CALM 
language, and Dauphin and Smaky microprocessor systems. Nicoud, 
Jean-Daniel, M-M Feb 91 14-17, 62-68 
microprocessors in education (special issue). M-M Feb 91 14-83 
teaching methods for peripheral hardware class and hands-on 
multitasking lab. Schultz, Thomas W., M-M Feb 91 30-33, 80-82 
Computer vision; cf. Machine vision 

Computers; cf. Database machines; Distributed computing; 
Microcomputers; Optical computing; Parallel processing; Virtual 
computers 

Content-addressable memories; cf. Associative memories 

Control engineering education 

configurable, virtual microprocessor system to simulate and validate 
process plant designs in classroom. Russell, David W., + , M-M Feb 
91 26-29 

Control systems; cf. Digital control 

Copyright protection 

answering questions on copyright office forms correctly (Micro Law). 

Stern, Richard H., M-M Oct 91 33-34 
database system copyrights (Micro Law). Stern, Richard H., M-M Dec 
91 77-79 

Copyright protection; cf. Software protection 

Custom integrated circuits; cf. Application-specific integrated circuits 


D 

Data acquisition; cf. Measurement-system data handling 

Data buses 

IEEE P996 draft standard for AT-bus; development (Micro Standards). 
Warren, Carl, M-M Feb 91 45, 56 


Data communication; cf. Distributed computing; Local area networks; 

Measurement-system data handling; Message switching; 

Multiprocessing, interconnection 
Data processing; cf. Database systems 

Data structures 

associative caches and associative circuits as accelerators for large 
databases. Faudemay, Pascal, + , M-M Dec 91 22-34 

Database machines 

associative caches and associative circuits as accelerators for large 
databases. Faudemay, Pascal, +, M-M Dec 91 22-34 
database machines (special issue). M-M Dec 91 6-76 
multibackend database supercomputer interconnected by LAN; 

architecture and performance. Hsiao, David K., M-M Dec 91 44-60 
RINDA relational database processor with hardware specialized for 
searching and sorting. Inoue, Ushio, + , M-M Dec 91 61-70 
Database management systems; cf. Database systems, searching 
Database systems 

database system copyrights (Micro Law). Stern, Richard H., M-M Dec 
91 77-79 

Database systems; cf. Distributed database systems 
Database systems, relational 

fine grain architecture for relational database aggregation. 

Abdelguerfi, M., + , M-M Dec 91 35-43 
RINDA relational database processor with hardware specialized for 
searching and sorting. Inoue, Ushio, + , M-M Dec 91 61-70 
VLSI accelerators for improving large database system performance. 
Lee, Kuo Chu, + , M-M Dec 91 8-20 
Database systems, searching 

RINDA relational database processor with hardware specialized for 
searching and sorting. Inoue, Ushio, + , M-M Dec 91 61-70 
VLSI accelerators for improving large database system performance. 
Lee, Kuo Chu, + , M-M Dec 91 8-20 
Design automation; cf. Circuit simulation; Design automation software 
Design automation software 

shielded twisted-pair cable characteristics evaluation using Math CAD 
(On the Edge). Warren, Carl, M-M Feb 91 46-47 

Digital control 

configurable, virtual microprocessor system to simulate and validate 
process plant designs in classroom. Russell, David W., + , M-M Feb 
91 26-29 

Digital integrated circuits; cf. Very-large-scale integration 
Digital signal processors; cf. Microprocessors 
Disk recording; cf. Optical memories 

Distributed computing 

iWarp microprocessor supporting message-passing and systolic 
communications for multicomputers. Peterson, Craig, + , M-M Jun 
91 26-29, 81-87 

Distributed computing; cf. Distributed database systems; 

Multiprocessing 

Distributed database systems 

Alphom remote procedure call environment for fault-tolerant, 
heterogeneous, distributed systems. Aschmann, Hans-Ruedi, + , 
M-M Oct 91 16-19,60-67 

multibackend database supercomputer interconnected by LAN; 
architecture and performance. Hsiao, David K., M-M Dec 91 44-60 


E 

Education 

microprocessors in education (special issue). M-M Feb 91 14-83 
Education; cf. Computer science education 

Educational technology; cf. Computer science education; Visualization 
Electrical engineering education; cf. Control engineering education 
EMG (electromyography); cf. Muscles, EMG 
England; cf. United Kingdom 

Europe 

use of English units in European hardware and software systems 
(Micro World). Kirrmann, Hubert, M-M Aug 91 36-39 
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Expert systems 

KMDS expert system for integrated hardware/software design of 
microprocessor-based digital systems. Kuo, Yau-Hwang, + , M-M 
Aug 91 32-35, 86-92 


F 


Fault tolerance; cf. Redundant systems 
Forecasting; cf. Technology forecasting 

France 

project management, site, track, and mobile computers used in 
construction of England-to-France tunnel (Micro World). Kirrmann, 
Hubert, M-M Jun 91 4-6 

Fuzzy set theory 

fuzzy set theory applications and advances in Japan (Software Report). 
Klir, George J., M-M Aug 97 8-11 


H 


History 

microcomputer developments during past ten years (Micro World). 

Kirrmann, Hubert, M-M Feb 91 42-44 
tenth anniversary retrospective. Seaborn, True, + , M-M Feb 91 5-9, 
61 

Holographic memories 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 
Human factors; cf. Cognitive science 


IEEE Micro 

tenth anniversary retrospective. Seaborn, True, + , M-M Feb 91 5-9, 
61 

EEEE standards 

developments in several draft standards (Micro Standards). Warren, 
Carl, M-M Aug 91 40-41 

IEEE open microprocessor architecture standard PI 754 (On the Edge). 
Warren, Carl, M-M Oct 91 30-33 

IEEE P1175/D11 draft standard on computing system tool 
interconnections overview (Micro Standards). Warren, Carl, M-M 
Dec 91 83-85 

IEEE P996 draft standard for AT-bus; development (Micro Standards). 
Warren, Carl, M-M Feb 91 45, 56 

real-time computing with IEEE Futurebus+. Sha, Lui, + , M-M Jun 91 
30-33, 95-100 

Information systems; cf. Database systems 
Instrumentation; cf. Measurement 

Integrated circuits; cf. Application-specific integrated circuits; 
Microprocessors; Very-large-scale integration 


J 


Japan 

Far East (special issue). M-M Aug 91 12-31 

fuzzy set theory applications and advances in Japan (Software Report). 
Klir, George J., M-M Aug 97 8-11 


K 

Knowledge-based systems; cf. Expert systems 


L 

LAN; cf. Local area networks 
Languages; cf. Computer languages 
Learning systems; cf. Neural networks 
Legal factors 

Brooktree Corp. vs. Advanced Micro Devices, Inc.; issues arising from 
first chip-layout copying case (Micro Law). Stern, Richard H., M-M 
Aug 91 3-6, 94 

Legal factors; cf. Copyright protection; Software protection 

Local area networks 

multibackend database supercomputer interconnected by LAN; 
architecture and performance. Hsiao, David K., M-M Dec 91 44-60 


M 

Machine vision 

2D analog VLSI network simulating function of human visual, 
peripheral processes. Li, Hua, + , M-M Oct 91 8-11, 44-47 
Management; cf. Project management 
Mass memories 

trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 

Measurement 

use of English units in European hardware and software systems 
(Micro World). Kirrmann, Hubert, M-M Aug 91 36-39 
Measurement-system data handling 

Futurebus interface design using off-the-shelf parts for graph reduction 
in parallel (GRIP) system. Peyton Jones, Simon L., + , M-M Feb 91 
38-41,84-93 

Measurement-systems data handling; cf. Data buses 
Memories; cf. Associative memories; Cache memories; Holographic 
memories; Mass memories; Microcomputer memories; Optical 
memories 

Memory management; cf. Protocols, memory 
Mental models; cf. Cognitive science 
Message switching 

interprocessor communication performance evaluation of five types of 
hypercube and grid-topology multicomputers. Zhang, Xiaodong, 
M-M Apr 91 12-15, 52-55f 

Microcomputer applications 

trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 

Microcomputer interfaces 

microcomputer-interfacing laboratory projects. Fulcher, John A., M-M 
Feb 91 18-21, 75-78 

Microcomputer memories 

trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 
Microcomputer networks; cf. Distributed computing; Multiprocessing 

Microcomputer performance 

IBM RISC System/6000 architecture and performance; benchmark 
details. Oehler, Richard R., + , M-M Jun 91 14-17, 56-62 

redesign of Clipper C400 RISC architecture. Sachs, Howard G., + , 
M-M Jun 91 18-21,74-80 

Microcomputer peripherals 

teaching methods for peripheral hardware class and hands-on 
multitasking lab. Schultz, Thomas W., M-M Feb 91 30-33, 80-82 
Microcomputer peripherals; cf. Microcomputer interfaces 
Microcomputer software 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 

Microcomputers 

microcomputer developments during past ten years (Micro World). 
Kirrmann, Hubert, M-M Feb 91 42-44 
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Microcomputers; cf. Microprocessors 
Microprocessor applications; cf. Specific topic 
Microprocessors 

book review; Intel’s Official Guide to 386 Computing (Edelhart, M.; 

1991). Mateosian, Richard, M-M Jun 91 50-51 
Datawave single-chip multiprocessor for video applications. Schmidt, 
Ulrich, + , M-M Jun 91 22-25, 88-94 
DSP-based Petri-net simulation tool. Di Stefano, Antonella, + , M-M 
Apr 91 20-23, 56-64 

evolving architectures for microprocessors; overview. Slater, Michael, 
M-M Feb 91 96-95 

Gmicro/100 32-b microprocessor based on TRON specifications. 

Yoshida, Toyohiko, + , M-M Aug 91 20-23, 62-72 
Hot Chips II: Sweetening the pot (special issue). M-M Jun 91 8-29 
i860 microprocessor design and performance. Atkins, Mark, M-M Oct 
91 24-27, 72-78 

IBM RISC System/6000 architecture and performance; benchmark 
details. Oehler, Richard R., + , M-M Jun 91 14-17, 56-62 
IEEE open microprocessor architecture standard PI 754 (On the Edge). 
Warren, Carl, M-M Oct 91 30-33 

iWarp microprocessor supporting message-passing and systolic 
communications for multicomputers. Peterson, Craig, + , M-M Jun 
91 26-29, 81-87 

Metaflow architecture using deferred scheduling and 
register-renaming instruction shelf to manage out-of-order 
execution. Popescu, Val, + , M-M Jun 91 10-13, 63-73 
redesign of Clipper C400 RISC architecture. Sachs, Howard G., + , 
M-M Jun 91 18-21,74-80 

RISC processor for embedded applications within ASIC. Roberts, 
Charles E., M-M Oct 91 20-23, 68-72 
trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 
Microprocessors; cf. Microcomputers; Multiprocessing... 

Modeling; cf. Petri nets; Simulation 
Multiprocessing 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 
DSP-based Petri-net simulation tool. Di Stefano, Antonella, + , M-M 
Apr 91 20-23, 56-64 

Futurebus interface design using off-the-shelf parts for graph reduction 
in parallel (GRIP) system. Peyton Jones, Simon L., + , M-M Feb 91 
38-41, 84-93 

ITRON-MP adaptive real-time kernel specification for 
shared-memory multiprocessor systems. Takada, Hiroaki, + , M-M 
Aug 91 24-27, 78-85 

KRPP and DSNS superscalar architectures using task and instruction 
level parallelism. Fukuda, Akira, + , M-M Aug 91 16-19, 50-61 
RST cache memory design for tightly coupled Clipper-based 
multiprocessor system. Prete, Cosimo A., M-M Apr 91 16-19, 40-52 
Sure System 2000 fault-tolerant computer using hardware and 
software local redundancies. Kabemoto, Akira, + , M-M Aug 91 
28-31,73-78 

Multiprocessing; cf. Array processing; Distributed computing 
Multiprocessing, interconnection 

interprocessor communication performance evaluation of five types of 
hypercube and grid-topology multicomputers. Zhang, Xiaodong, 
M-M Apr 91 12-15,52-55f 

Multiprocessing, interconnection; cf. Local area networks 
Multiprocessors 

ITRON-MP adaptive real-time kernel specification for 
shared-memory multiprocessor systems. Takada, Hiroaki, + , M-M 
Aug 91 24-27, 78-85 

Multitasking 

teaching methods for peripheral hardware class and hands-on 
multitasking lab. Schultz, Thomas W., M-M Feb 91 30-33, 80-82 

Muscles, EMG 

myoelectric signal analysis using computer systems and signal 
processing techniques. Knaflitz, Marco, +, M-M Oct 91 12-15,48-58 


N 

Networks; cf. Multiprocessing, interconnection; Neural networks; Petri 
nets 

Neural networks 

optical computing overview; relation to neural networks (Software 
Report). Kahaner, David K., M-M Feb 91 53-56 


O 

Optical computing 

optical computing overview; relation to neural networks (Software 
Report). Kahaner, David K., M-M Feb 91 53-56 
trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 

Optical memories 

3-D optical architecture. Louri, Ahmed, M-M Apr 91 24-27, 65-82 


P 

Packet switching 

Futurebus interface design using off-the-shelf parts for graph reduction 
in parallel (GRIP) system. Peyton Jones, Simon L., + , M-M Feb 91 
38-41,84-93 

Parallel processing 

book review; Superscalar Microprocessor Design (Johnson, M.). 
Mateosian, Richard, M-M Jun 91 51, 102 

IBM RISC System/6000 architecture and performance; benchmark 
details. Oehler, Richard R., + , M-M Jun 91 14-17, 56-62 

Metaflow architecture using deferred scheduling and 
register-renaming instruction shelf to manage out-of-order 
execution. Popescu, Val, + , M-M Jun 91 10-13, 63-73 

rate monotonic scheduling algorithm applied to programming 
real-time systems (On the Edge). Warren, Carl, M-M Jun 91 34-38, 
102 

Parallel processing; cf. Multiprocessing; Pipeline processing 

Parallel processing, interconnection; cf. Multiprocessing, 
interconnection 

Patents; cf. Software protection 

Personal computers; cf. Microcomputers 

Petri nets 

DSP-based Petri-net simulation tool. Di Stefano, Antonella, + , M-M 
Apr 91 20-23, 56-64 

Pipeline processing 

fine grain architecture for relational database aggregation. 
Abdelguerf, M., + , M-M Dec 91 35-43 

redesign of Clipper C400 RISC architecture. Sachs, Howard G., + , 
M-M Jun 91 18-21,74-80 

Project management 

project management, site, track, and mobile computers used in 
construction of England-to-France tunnel (Micro World). Kirrmann, 
Hubert, M-M Jun 91 4-6 

Protocols 

Futurebus interface design using off-the-shelf parts for graph reduction 
in parallel (GRIP) system. Peyton Jones, Simon L., + , M-M Feb 91 
38-41,84-93 

Protocols, memory 

RST cache memory design for tightly coupled Clipper-based 
multiprocessor system. Prete, Cosimo A., M-M Apr 91 16-19, 40-52 

Publishing; cf. Copyright protection 


R 

Real-time systems 

ITRON-MP adaptive real-time kernel specification for 
shared-memory multiprocessor systems. Takada, Hiroaki, + , M-M 
Aug 91 24-27, 78-85 

rate monotonic scheduling algorithm applied to programming 
real-time systems (On the Edge). Warren, Carl, M-M Jun 91 34-38, 
102 
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real-time computing with IEEE Futurebus+. Sha, Lui, + , M-M Jun 91 
30-33, 95-100 

Redundant systems 

Sure System 2000 fault-tolerant computer using hardware and 
software local redundancies. Kabemoto, Akira, + , M-M Aug 91 
28-31,73-78 

Routing; cf. Multiprocessing, interconnection 


S 


Scheduling 

Metaflow architecture using deferred scheduling and 
register-renaming instruction shelf to manage out-of-order 
execution. Popescu, Val, + , M-M Jun 91 10-13, 63-73 
rate monotonic scheduling algorithm applied to programming 
real-time systems (On the Edge). Warren, Carl, M-M Jun 91 34-38, 
102 

Search methods; cf. Database systems, searching 
Semicustom integrated circuits; cf. Application-specific integrated 
circuits 

Set theory; cf. Fuzzy set theory 
Shielding; cf. Wire communication cable shielding 
Signal analysis; cf. Biomedical signal analysis 
Signal processing; cf. Video signal processing 

Simulation 

configurable, virtual microprocessor system to simulate and validate 
process plant designs in classroom. Russell, David W., + , M-M Feb 
91 26-29 

DSP-based Petri-net simulation tool. Di Stefano, Antonella, + , M-M 
Apr 91 20-23, 56-64 
Simulation; cf. Circuit simulation 

Software; cf. Computer languages; Design automation software; 

Microcomputer software; Multitasking 
Software, utility programs; cf. Computer interfaces 
Software design/development; cf. Data structures 
Software education; cf. Computer science education 
Software protection 

Ashton-Tate Corp. vs. Fox Software, Inc; applicability of patent and 
copyright laws to software (Micro Law). Stern, Richard H., M-M Jun 
91 42-46 

Lotus Development Corp. vs. Paperback Software International; 
court’s failure to recognize functionality of standardization (Micro 
Law). Stem, Richard M., M-M Feb 91 48-51 
Lotus Development Corp. vs. Paperback Software International; 
effects of courts decision on software industry (Micro Law). Stem, 
Richard H., M-M Apr 91 30-33 
Software standards 

Lotus Development Corp. vs. Paperback Software International; 
court’s failure to recognize functionality of standardization (Micro 
Law). Stern, Richard M., M-M Feb 91 48-51 
Lotus Development Corp. vs. Paperback Software International; 
effects of courts decision on software industry (Micro Law). Stem, 
Richard H., M-M Apr 91 30-33 
Sorting/merging; cf. Database systems, relational 
Special issues/sections 

database machines. M-M Dec 91 6-76 
Far East. M-M Aug 91 12-31 
Hot Chips II: Sweetening the pot. M-M Jun 91 8-29 
microprocessors in education. M-M Feb 91 14-83 
Standards 

difficulties arising from diversity of methods for obtaining benchmarks 
(Micro Standards). Warren, Carl, M-M Apr 91 34-35 
Futurebus interface design using off-the-shelf parts for graph reduction 
in parallel (GRIP) system. Peyton Jones, Simon L., + , M-M Feb 91 
38-41, 84-93 


Standards; cf. IEEE standards; Software standards 


T 


Taiwan 

growth of computer industry in Taiwan (Software Report). Kahaner, 
David K., M-M Jun 91 39-41 

Technology forecasting 

trends in hardware and applications of microcomputers for next ten 
years. Myers, Ware, M-M Feb 91 10-13, 68-74 

Terminology 

book review; Computer Dictionary. Mateosian, Richard, M-M Aug 91 
43 

TV; cf. Video signal processing 

Twisted-pair cables 

shielded twisted-pair cable characteristics evaluation using Math CAD 
(On the Edge). Warren, Carl, M-M Feb 91 46-47 


U 


Uncertain systems; cf. Fuzzy set theory 

United Kingdom 

project management, site, track, and mobile computers used in 
construction of England-to-France tunnel (Micro World). Kirrmann, 
Hubert, M-M Jun 91 4-6 


V 


Very-large-scale integration 

2D analog VLSI network simulating function of human visual, 
peripheral processes. Li, Hua, + , M-M Oct 97 8-11, 44-47 
Gmicro/100 32-b microprocessor based on TRON specifications. 

Yoshida, Toyohiko, + , M-M Aug 91 20-23, 62-72 
VLSI accelerators for improving large database system performance. 
Lee, Kuo Chu, + , M-M Dec 91 8-20 
Video signal processing 

Datawave single-chip multiprocessor for video applications. Schmidt, 
Ulrich, + , M-M Jun 91 22-25, 88-94 

Virtual computers 

configurable, virtual microprocessor system to simulate and validate 
process plant designs in classroom. Russell, David W., + , M-M Feb 
91 26-29 

Vision systems (nonbiological); cf. Machine vision 
Visual languages 

book review; Postcript Language Reference Manual, 2nd edn. 
(Systems, A.; 1990). Mateosian, Richard, M-M Apr 91 28-29 

Visualization 

configurable, virtual microprocessor system to simulate and validate 
process plant designs in classroom. Russell, David W., + , M-M Feb 
91 26-29 

VLSI; cf. Very-large-scale integration 


W 

Wire communication cable shielding 

shielded twisted-pair cable characteristics evaluation using Math CAD 
(On the Edge). Warren, Carl, M-M Feb 91 46-47 
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Law 


Richard H. Stern 

Law Offices of 
Richard H. Stern 
1300 19th Street NW, 
Suite 400 

Washington, DC 20036 


Database system copyrights 


atabases, or more generally digital infor¬ 
mation, differ in important ways from the 
kind of information traditionally protected 
under copyright law. Protecting databases by 
copyright law impinges on different interests of 
plaintiffs and defendants than does protecting 
books, pictures, and songs—all traditional sub¬ 
jects of copyright law. A sensible legal policy may 
therefore call for different trade-offs among com¬ 
peting demands of owners, competitors, and us¬ 
ers to maximize the benefits to the public-or 
minimize the public harm-resulting from the op¬ 
eration of the legal system. 

Differences from 
traditional subjects 

Commentators 1 have pointed out ways in 
which databases differ from the traditional sub¬ 
ject matter of copyright law. 

• Digital information is often more easily and 
cheaply copied, making appropriation of 
such information easier and more profitable 
in the short run than appropriation of tradi¬ 
tional works. 

• Copyright law regards as significant the dis¬ 
tinctions among literary works (words), au¬ 
diovisual works (pictures, graphics, 
changing sequences of imagery), and musi¬ 
cal works (sound). But these distinctions blur 
for digital infonnation. It is all bits or bytes 
stored on tape, disk, or CD-ROM, no matter 
how you slice it. Why treat use of some bits 
differently? 

• Books can be read by eye. Pictures are also 
visually perceived. Music or other sound can 
simply be listened to and apprehended by 
ear. Most users come equipped with these 
hardware (or wetware) devices. Databases 
cannot be used without appropriate special- 
purpose hardware and software. 


• Finally, digital information is more easily 
modified into another form. You cannot 
readily turn the Mona Lisa into something 
else. But any time you access a database you 
can rearrange the downloaded material into 
almost any desired format. Users can and do 
readily create new things with old material, 
making new works that may be eligible for 
copyright protection. The Mona Lisa is in the 
public domain. So is Beethoven’s Eroica. 
With a database and appropriate hardware 
and software, a user can create a new Eroic- 
Mona pastiche. 

Another aspect of database systems, as they 
are now developing, defies even comparison 
with the traditional subject matter of copyright. 
This is the issue of what’s the data and what’s the 
program. I do not mean that you cannot inspect 
bits in a CD-ROM and figure out which bits com¬ 
prise data and which bits comprise instructions 
that manipulate the data. You may be able to do 
this, if someone has provided a debugger or 
disassembler or reverse compiler; otherwise, 
probably not. But so what? 

Who owns the data? 

The problem I mean is that a database user 
gets results by using a legally protectable com¬ 
puter program to manipulate data that is largely 
unprotectable in itself. This is because the data is 
old or simply was not created or owned by the 
proprietor of the database. Sweat equity in col¬ 
lecting and storing a mass of factual data, such as 
names and addresses of people who have tele¬ 
phone numbers in a town, is not legally protected 
under copyright law except for original aspects 
of selection and arrangement, if any. (The Su¬ 
preme Court recently so held in a decision that a 
telephone company held no copyright on its 
white pages directory, because there was no au- 
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thorial originality in the content. 2 ) 

The economic value of the data¬ 
base—and of any product you can cre¬ 
ate from it—comes from both the data 
and the software for accessing and or¬ 
ganizing it. Yet the resulting product 
does not visibly contain the software. It 
just contains the data, in a differently 
manipulated format. The result is like a 
compiled program that does not in¬ 
clude any “runtime modules.” (Runtime 
modules are units of code written for a 
compiler package and incorporated 
into a compiled code as part of compi¬ 
lation of source code.) Traditional 
copyright doctrine would suggest there 
is no computer program in the result 
and thus nothing legally protected. But 
that result may not make the best sense 
from a business standpoint or in terms 
of encouraging desired forms of enter¬ 
prise. Opinions will differ. 

You might take that example further, 
in terms of user interfaces. One of the 
things that makes a database easy to 
use is a well-designed user interface. In 
a sense, the interface contributes to the 
ease of making new products from a 
database and the quality level they 
reach, but the new product does not 
visibly contain the interface itself. Some 
might consider the resulting legal im¬ 
plications a cause for regret. Others will 
applaud. 

Blurred lines of authorship 

Generally speaking, copyright law 
protects only results that embody au¬ 
thorial originality—some kind of cre¬ 
ative spark by a person claiming to 
have done the creating. This legal prin¬ 
ciple impinges on databases in two 
ways. First, the proprietor of the data¬ 
base system, including whatever soft¬ 
ware is involved, cannot claim to be 
the creator of what some other person 
(a user) brings into existence by using 
the system. Building a soldering iron 
does not make you the creator of a 
breadboard whose components some¬ 
one else solders together. 

Second, there may be a question 
about whether even the person who 


uses the database system to produce 
something is an author of an original 
work. The person may not contribute 
enough, either. This may be a situation 
in which authorship falls into the bit 
bucket. 

The same kind of question has been 
raised for hypertext linkages for data¬ 
base entries. A hypertext linkage is a 
facility like a footnote. If this text (for 
example, the preceding sentence) were 
on your screen because it belonged to 
a database that you were using (who 
knows, IEEE Micro on line for hyper¬ 
text may be just around the comer), you 
could mouse click on the word “hyper¬ 
text” (or otherwise enter it). A window 
would appear, showing a definition of 
hypertext. Or you could click on the 
word “footnote” and a window would 
pop up, explaining what a footnote is. 
That would occur, however, only if 
someone went over this text and cre¬ 
ated a set of hypertext linkages. The 
linkages would invoke routines that 
make the program controlling display 
of this text jump to an entry elsewhere 
in the database system, to stored text 
describing hypertext, footnotes, or 
whatever. 

Creating such hypertext linkages is 
said to be a form of scholarship. Doubt 
has been expressed, however, that such 
scholarship results in any copyrightable 
work of authorship. If that means that 
no effective reward system will encour¬ 
age creation of hypertext linkages, per¬ 
haps there is a social problem. 

On the other hand, creation of some 
hypertext linkages for my text might 
offend me. That, too, could cause a so¬ 
cial problem. Suppose that someone 
links this text to a database dictionary 
entry to exemplify the term “GIGO.” 
Perhaps, I would not want my work 
linked in that manner. Should authors 
have control over hypertext linkages to 
or from their work? Under traditional 
copyright law principles, I can prevent 
my work from being reproduced with¬ 
out permission (subject to fair-use limi¬ 
tations on my rights). Hypertext link¬ 
ages may dilute that right. Is there a 


copyright infringement that I can sup¬ 
press by law when someone makes a 
hypertext linkage? Probably not. There 
is no unauthorized reproduction of my 
work (once it has lawfully been placed 
into the database), public performance 
of it, or other recognized category of 
copyright infringement. 

Those are some of the problems we 
can anticipate when the courts try to 
apply copyright law to database sys¬ 
tems. It is the same kind of problem 
you run into when you try to use a set 
of legal categories developed for one 
purpose as a mechanism for regulating 
something else that somebody says 
ought to be regarded as the same thing 
and therefore regulated in the same 
way. What is ever the same as anything 
else that went before? That has been a 
problem since the Greeks looked at riv¬ 
ers or tried to address convergence to¬ 
ward a limit. (See box on next page.) 

Sometimes this legal process works, 
because the two things are enough the 
same for the purpose at hand, or are 
sufficiently the same considering how 
much (or little) money you want to 
spend to resolve the issue. Other times, 
the process works dismally, because 
the new subject matter is so vastly dif¬ 
ferent from the old that the legal results 
do not pass the well-known subjective 
tests. (These tests have various names, 
such as the “straight face” test or the 
“barf’ test. The question is whether the 
judgments that courts reach using the 
legal theory in question are too ridicu¬ 
lous.) 

It would be premature to pass judg¬ 
ment now on whether applying copy¬ 
right law to the use of database systems 
will pass or fail these tests. The real- 
world chickens haven’t come home to 
roost, yet. 
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When are two things 
the same? 

Heracleitus was one of the first 
to address this issue when he 
asked if one ever steps twice into 
the same river. Zeno then came 
up with a misguided way of ad¬ 
dressing a series of the form t = 
(s/v)(l + 0.1 + 0.01 + ...). He be¬ 
lieved there was no answer, that 
is, no knowable value for t , sim¬ 
ply because the series was infi¬ 
nite. 

Zeno argued that Achilles, run¬ 
ning 10 times as fast as a tortoise 
creeps, could never catch up to it. 
Each time Achilles closed a gap 
ds , the tortoise advanced another 
increment of 0.1 ds, and thus 
would forever stay ahead. (If Zeno 
believed that, he would have been 
willing to believe a great many 
other things.) 

This is much like the logic of 
copyright law and indeed legal 
logic in general. Which parameters 
you designate as independent vari¬ 
ables determine the outcome. If 
you decide to talk about F(g), the 
plaintiff wins; if you decide to talk 
about G (f), the defendant wins. 
It’s like integrating by parts, where 
duv = udv + vdu. What you de¬ 
cide to designate u and what you 
decide to designate v (or dv) de¬ 
termines whether you will solve 
the problem or just make it worse. 
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SBus: an open bus architecture 


[In the third part of our series on tools, 
Rudolf Usselmann discusses the SBus 
and its progress towards IEEE standard¬ 
ization. 

I invite readers to send information 
on a tool or method that solves prob¬ 
lems, for consideration in future col¬ 
umns.-C.WJ 


Rudolf Usselmann 
Sparc International 

un Microsystems originally 
developed the SBus for the 
Sparcstation 1 as an inexpen¬ 
sive, high-performance system inter¬ 
connection bus. “The original goal of 
the SBus was to provide an expansion 
bus for free, as our cost target for the 
Sparcstation 1 was zero dollars,” said 
Jim Ludemann of Antares Microsystems. 
Ludemann was one of the original 
codesigners of Sparcstation 1 and SBus. 

“This meant we had to design a bus 
flexible enough for I/O, but fast enough 
to connect main memory with the 
cache,” Ludemann said. “The SBus was 
literally the only bus in Sparcstation 1. 
We knew that in future system imple¬ 
mentations the SBus would be used 
only for I/O, as system design consid¬ 
erations require a custom memory bus 
for each new system. Furthermore, 
SBus was designed from the start to be 
implemented with CMOS gate arrays, 


giving us low cost, high performance, 
and low power usage.” 

As we can see in many implementa¬ 
tions of SBus, the company followed 
its plans precisely. Even clone manu¬ 
facturers followed these steps and 
implemented SBus in their systems. A 
good example is the Toshiba Laptop, 
which has one SBus slot for extensions. 

Features 

SBus offers a wide variety of advan¬ 
tages for system and peripheral devel¬ 
opers that other buses cannot offer as 
a package. Individually, its features are 
easily achieved and are incorporated 
in other buses. As a combined pack¬ 
age, they offer a very good solution 
for modular I/O subsystems extensions. 
(See Table 1 on page 81 for detailed 
signal definitions.) 

Low device count. Various compa¬ 
nies offer one-chip solutions for SBus 
developers. These chips integrate the 
complete SBus interface on one IC. 
Peripheral developers have an easy task 
connecting standard I/O devices 
(serial I/O, SCSI, and LAN controllers) 
to single chips and don’t have to worry 
about correct SBus implementation. 
System developers can purchase inter¬ 
face chips that allow communication 
from high-speed CPU interconnection 
buses (for example, MBus) to SBus, 
thus simplifying the task of system 
development dramatically. 

Low-power CMOS implementa¬ 
tion. SBus is 100 percent compatible 
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with CMOS technology, which elimi¬ 
nates the need for high current drivers 
and large power supplies, as well as 
problems with heat dissipation. 

Flexible high performance. The 
original implementation of SBus allows 
data transfer rates up to 80 Mbytes/s. 
After the company released SBus to 
public domain, users formed an SBus 
working group and started working on 
an extension. 

The SBus public committee devel¬ 
oped an extension for SBus to allow 
64-bit operations. This meant defining 
new transfer types and introducing 
modes without breaking current imple¬ 
mentations. It was not a simple task. 

The extension of SBus to 64 bits 
increases performance to more than 
160 Mbytes/s. This will help I/O per¬ 
formance keep up with ever-increas¬ 
ing CPU performance. In the past, many 
I/O devices could not connect to an 
I/O bus because of their need for high 
throughput. The extensions will allow 


these devices to connect to SBus. How¬ 
ever, SBus sizing does not limit one to 
a certain transfer rate. SBus dynamic 
bus sizing and a wide variety of cycle 
types can fit many needs. 

Open Boot. Open Boot firmware 
supports the SBus and assists software 
drivers. Each SBus board has a small 
ROM containing identification informa¬ 
tion and optional boot and diagnostic 
drivers. The identification information 
includes such generic items as the 
device name, register addresses, and 
interrupt levels and may include 
device-specific items such as the reso¬ 
lution of a display device. 

This ROM information is stored as a 
computer program written in F Code, 
a byte-coded version of the Forth pro¬ 
gramming language. F Code is inde¬ 
pendent of the CPU’s instruction set 
architecture, so the SBus device’s F 
Code ROM can be used without change 
on systems with different CPU types. 

The Open Boot firmware residing on 



(b) 


Figure 1. Low-cost (a) and high-performance (b) systems. 


Contacts 

SBus study group 

e-mail: sbus_sg@sparc.com 
(Mail goes to 150 members) 

Wayne Fischer 
Chair, SBus study group 
Force Computers 
3165 Winchester Blvd. 
Campbell, CA 95008 
phone: (408) 370-6300 
fax: (408) 374-1146 
e-mail: uunetiforce-dwfischer 

SBus study group alias 
maintenance and requests 

e-mail: sbus_sgreq@sparc.com 

SBus Specifications 

I lamilton Avnet Electronics 
(800) 442-6458 

Sun’s Interoperability Center 

Yatin Trivedi 

2550 Garcia Ave., PALI-106 
Mountain View, CA 94043 
(415) 336-1812 
e-mail: trivedi@sun.com 


the CPU board contains an interpreter 
for the F Code language. Interpreta¬ 
tion of the F Code ROM creates an entry 
for the device in the firmware’s device 
tree, a hierarchical data structure 
describing the system configuration. 
The operating system software inspects 
this device tree to determine which de¬ 
vices are present in the system and the 
individual characteristics of those 
devices. 

If the F Code ROM contains diag¬ 
nostic and/or boot drivers, those driv¬ 
ers may be used by the firmware for 
such tasks as testing the device hard¬ 
ware, loading the operating system 
from that device, or displaying start¬ 
up messages using the device. 

Open Boot’s F Code scheme is not 
limited to SBus. F Code is being con¬ 
sidered for use with Futurebus+ and 
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Table 1. Signal definitions for SBus standard 32-bit and extended 64-bit modes. All signals are distributed 
via a high-density 96-pin connector, which also includes GND, +5V, and ±12V supplies. 

Signal 

I/O Driven bv 

DescriDtion 


PA(27:00) [D(59:32)j* 1 Controller 

Physical address 


Sel 

1 Controller 

Slave select (one per slave) 


D(31:0) [D(31:0)]* I/O Masters/slave 

Data 


Siz(2:0) [D(62:60)j* I/O Masters 

Transfer size 


Rd [D(63)j* 

I/O Masters 

Transfer direction 


AS 

1 Controller 

Address strobe 


Ack(2:0) 

I/O Slaves/controller 

Transfer acknowledgment 


LErr 

I/O Slaves 

Late data error 


BR 

O Masters 

Bus request (one per master) 

BG 

1 Controller 

Bus grant (one per master) 


Clk 

1 Controller 

SBus clock 


Reset 

1 Controller 

Reset 


lntRes(7:1)_ 

0 Slaves 

Interrupt request (open drain) 

DtaPar 

I/O Master/slaves 

Data parity (optional) 


In an extended mode, the signals below are time multiplexed 


Sianal 

Description 




D(31) 

Extended transfer type 



D(30:28) 

Extended transfer size (2:0) 



D(27) 

Extended transfer read 



D(26:25) 

Extended transfer atomic (1:0) 



D(24:0) 

Extended transfer reserved (24:0) 



Supported sizes 




Sianal 

Description 


Description 


ETSiz(2:0) 



Extended (64-bit) mode 


Siz(2:0) 

Standard (32-bit) mode 



000 

Word (4-byte) transfer 

Reserved 


001 

Byte transfer 


Reserved 


010 

Half-word (2-byte) transfer 

Reserved 


011 

Extended Transfer 


8 bytes 


100 

Four-word burst (16 bytes) 

16 bytes 


101 

Eight-word burst (32 bytes) 

32 bytes 


110 

Sixteen-word burst (64 bytes) 

64 bytes 


111 

Two-word burst (8 bytes) 

128 bytes 


Acknowledgment encoding 


Atomic cycle types 

Ack(2:0)_ 

Function 


ETAtomicd :0) 

Description 

111 

Idle/Wait 


00 

Slormal bus cycle (non-atomic bus 

110 

Error acknowledgment 


cycle) 

101 

Byte (data) acknowledgment 

01 

First bus cycle of an atomic 

100 

Rerun acknowledgment 


transaction 

011 

Word (data) acknowledgment 

10 

ntermediate bus cycle of an atomic 

010 

Double-word (data) acknowledgment 


transaction 

001 

Half-word (data) acknowledgment 

11 

_ast cycle of an atomic transaction 

000 

Reserved 




* Signals in square brackets are for 64-bit mode. They are shared with the regular signals when that mode is enabled. 

_ Signals with an underscore mark indicate low-active signals. 



VME-D, and its design applies to other 
buses as well. Open Boot firmware is 
the subject of an IEEE standardization 
effort under project number PI275. 


Current status 

Sun Microsystems has placed SBus 
in the public domain and helped to 
establish a Public Specification Com¬ 


mittee to continue developing the SBus 
specification. This committee is work¬ 
ing on extensions to SBus and is 
addressing the undefined and unclear 
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aspects of its specification. 

Ongoing project 

Because of the wide acceptance of 
the SBus standard, Wayne Fischer of 
Force Computers asked the IEEE Com¬ 
puter Society and the Bus Architecture 
Standards Committee to open a Project 
Authorization Request (PAR) to estab¬ 
lish SBus as an IEEE standard. In June 
1991 IEEE’s Bus Architecture Standards 
Committee agreed to sponsor a study 
group. At the June meeting of the SBus 
Public Spec Committee, Fischer sug¬ 
gested this group merge into the SBus 
study group. The committee unani¬ 
mously approved. At publication, this 
group has met three times, most 
recently on December 3-4. The next 
meeting is scheduled for January 22- 
23, 1992, in Salt Lake City. 

Interoperability Center 

The Interoperability Center offers 
SBus card developers help in design¬ 
ing hardware and software and assists 
with porting SBus subsystems to the 
company’s different platforms. It is 
equipped with most hardware and soft¬ 
ware tools a designer might need, 
including logic analyzers, oscilloscopes, 
and most of the popular schematic 
editors available on the market. The 


center has a well-experienced staff to 
assist designers with problems. 

Compatibility 

Today vendors produce more than 
400 different SBus boards. Thus, com¬ 
patibility is becoming an increasingly 
important issue to vendors and end 
users. Sparc International introduced, 
in summer 1991, a program for com¬ 
patibility testing of SBus boards on 
Sparc-compliant platforms. This service 
verifies the compatibility of SBus boards 
and accommodating software across a 
wide variety of platforms. 

In a joint effort with VME Labs, Sparc 
International plans to expand this pro¬ 
gram to include full system-indepen¬ 
dent hardware and software verification 
by mid 1992. VME Labs plans to per¬ 
form hardware verification; Sparc 
International plans to test device driv¬ 
ers and other accommodating software. 
Currendy the lack of a device driver 
standard makes testing impossible. 
However, Sparc International and its 
members are developing a uniform 
standard for the next Unix release SVR4. 

Summary 

Considering the typical lifetime of an 
I/O bus, SBus is in its infancy and has 
a lot of room to grow. If it becomes a 


standard, its growth will be guaranteed 
in a controlled manner, not in chaos, 
like other buses. Furthermore, the stan¬ 
dardization of SBus will ensure its con¬ 
tinued support and expandability. 
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Computing interconnections 


he emphasis on networking puts a new 
burden on system designers and software 
engineers: the interchange of data in a 
seamless fashion. Many methods have been con¬ 
sidered and have proven workable at least on a 
one-to-one basis. For example, early on Byte 
magazine attempted to provide a common for¬ 
mat for information interchange using the Kan¬ 
sas City Standard for audiotape recording of 
computer data and programs. The system worked 
in the context of the machines and users at that 
time. 

Publishers of the now-defunct Interface Age 
magazine, of which incidentally I served as edi¬ 
tor-in-chief, saw in 1977 the need for ensuring 
that common data was used in the many differ¬ 
ent processor types and operating systems. The 
solution was the IPIS (International Platform In¬ 
terchange System). This approach consisted of a 
software process that ensured that data either 
conformed to a specific format or could be trans¬ 
lated to that format with some manipulation on 
a host or target system. The problems with IPIS 
were that no ground swell requirement for in¬ 
terchange existed and operating systems and 
languages were still being defined. Essentially 
we had a solution that was out of context with 
the need at the time. We had come up with an 
interesting addition to computer science, but it 
didn’t catch on. 

Robust systems 

Today, manufacturers deliver platforms with 
robust and well-defined operating systems and 
environments. It isn’t unusual to find a large 
network consisting of Macintoshes, IBM PCs, Sun 
and Hewlett-Packard workstations, and large 
mainframes. In some cases the ability to inter¬ 
change data is common due to the operating 


environment. Unix machines, for example, pro¬ 
vide not only a common file transport mecha¬ 
nism (File Transport Protocol—FTP or Network 
File System—NFS) but also file and data formats 
that are similar if not exactly the same. 

Not all systems run Unix, and not all Unix 
boxes are implemented in the same way. Con¬ 
sequently, not all data can be guaranteed to trans¬ 
late to another platform the same way. A 
Unix-to-VAX/VMS interchange not only requires 
translating syntax and methods of naming files 
but the basic file structures as well. 

To this end, work has been going on to de¬ 
velop a common interconnection tool that en¬ 
sures the easy and obvious way of moving and 
sharing data. 

The interconnection tool 

The IEEE effort is P1175/D11 (draft 11), A Stan¬ 
dard Reference Model for Computing System 
Tool Interconnections, prepared by the IEEE 
Computer Society’s Task Force on Professional 
Computing Tools. The balloting committee has 
approved PI 175, which awaits final approval by 
the IEEE Standards Board this month. For cop¬ 
ies of P1175, telephone the IEEE Standards of¬ 
fice in Piscataway, New Jersey, at (800) 678-IEEE; 
the fax number there is (908) 981-9667. For an¬ 
swers to questions about the draff standard, con¬ 
tact PI 175 Chair Robert M. Poston, Programming 
Environments, Inc., 4043 State Highway 33, 
Tinton Falls, NJ 07753; telephone (908) 918-0110 
or fax (908) 918-0113. 

The PI 175 document describes a robust solu¬ 
tion to resolving interconnection problems called 
the Semantic Transfer Language (STL). The stan¬ 
dard describes STL as parsable, easy to read and 
understand by programmers, easy to write, and 
easy to convert to a compact transfer form. The 
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notion of the STL describes items in 
the fomi of English sentences that con¬ 
tain one subject, a verb or verb phase, 
and only one relation or attribute to 
the subject. To maintain commonality 
with exiting systems and languages, STL 
uses a modified version of the Backus- 
Naur Form (BNF) for syntax descrip¬ 
tion. For example, ::= means “is defined 
as,” and white spaces or tabs mean “is 
followed by.” Similarly, STL relates 
items by concepts and actions or func¬ 
tions of the concepts, such as those 
listed in the box. 

Interestingly, the overall concept of 
PI 175 and STL is based on communi¬ 
cations concepts overlaid onto system 
and software methods. Within the soft¬ 
ware constraints the issues of concern 
are actions, transformation of state, 
control and data; events, control, tim¬ 
ing, and synchronization of actions; in¬ 
formation and data, the primary object 
of software actions; logic, logical re¬ 
strictions (conditions) on actions; and 
finally states, the context for and evo¬ 
lution of actions. 

Standard not limited 

The architects of PI 175 haven’t lim¬ 
ited the standard by looking only at a 
narrow range of hardware and software 
concepts. Rather, they began at the 
uppermost part—the user hierarchy, 
the organization. They thought that if 
information can’t be easily and readily 
shared by an organization, it isn’t use¬ 
ful. Moreover, the organization estab¬ 
lishes the operation context and defines 
the platform(s) that will be used. For 
example, McDonnell Douglas prima¬ 
rily uses Macintosh Ilex systems and 
VAX mainframes, while supporting 
subcontractors using IBM PCs and a 
variety of other platforms. 

The next issue addressed by the 
working group was tool-to-platform 
interconnections. This so-called archi¬ 
tectural context is guided by the 
organization context. When the orga¬ 
nization is closed, the process is simple, 
and platform-to-platform interconnec¬ 
tions work easily. However when the 


mix is great either within or external 
to an organization, the problem be¬ 
comes greater, and tools must be avail¬ 
able for interconnecting platforms. The 
most logical way to solve the problem 
is to use networks that range from lo¬ 
cal area networks (LANs) to wide area 
networks (WANs). 

Once the organization and architec¬ 
tural issues and interconnections were 
worked out, the committee tackled the 
problem of interconnections among 
tools, the transfer context. How does 
data element A on platform A prob¬ 
ably get to platform Z over the net¬ 
work and remain a useful entity? Enter 
STL. The Semantic Transfer Language 
provides the mechanism to assist in the 
translation—not a trivial concept. 

Not a lonely effort 

The IEEE effort isn’t one that was 
pursued without support. Indeed, many 
members of the task team represent 
organizations and other standards 
groups that are interested in informa¬ 
tion interchange and interconnections. 
One notable effort that has been tak- 


STL concepts 

Action 

Collection 

Condition 

ConnectionPath 

Constant 

Dataltem 

DataKey 

DataPart 

DataRole 

DataStore 

DataType 

Data View 

Eventltem 

EventType 

GraphicSymbol 

Object 

S_Packet 

State 

StateTransition 


ing place resulted from NASA’s Space 
Station Freedom. The space station will 
be using a processing system to col¬ 
lect and download sensor data to earth 
stations. This system brought about the 
development of a Data Naming Stan¬ 
dard and a translation tool for ensur¬ 
ing that all the associated databases 
function in a common way. Thus the 
schema (the structure of the database) 
of one database can effectively com¬ 
municate with another database that 
has a different purpose. 

With this in mind and using com¬ 
munications concepts, the company 
developed the Interdiscipline Systems 
Definition Encylopedia (ISDE). This 
tool allows us to define database sche¬ 
mas in a common way and provides a 
translation mechanism so that any da¬ 
tabase can be made to appear to con¬ 
form to a common format. ISDE is 
implemented as a database for the 
Macintosh using 4th Dimension Data¬ 
base Manager software from Acius 
Corp. The software checks schemas 
against the McDonnell Douglas Data 
and Object Standards (DAOS) naming 
standard for conformity. 

Although McDonnell Douglas de¬ 
signed ISDE to solve a specific prob¬ 
lem associated with the space station, 
it arranged ISDE to follow the same 
rules as PI 175. However, the approach 
is one of a relational database rather 
than a semantic transfer language. Both 
methods have merit and should be 
combined. Therefore, I’m recommend¬ 
ing to the PI 175 chair that the ISDE 
model be considered as an addendum 
to the basic PI 175 work. Combining 
the STL with a relational database tool 
will result in a common tool definition 
that will ensure accurate infonnation 
interchange regardless of platform or 
implementation. However, although I 
will seek a Project Authorization Re¬ 
quest for this work, PI 175 first faces a 
two-year trial-use period (beginning 
shortly) before an addendum can be 
considered. 

For additional information about the 
Data and Object Naming Standard 
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(DAOS), contact C.R. Easton, McDon¬ 
nell Douglas Space Systems Co., Space 
Station Div., 5301 Bolsa Ave., M/S A95- 
J845-17/6; Huntington Beach, CA 
92647; telephone (714) 896-4551; fax 
(714) 896-5034. 

For more information about ISDE, 
contact Lee Neitzel, CTA Corp., 18333 
Egret Bay Blvd., Suite 310, Houston, 
TX 77058; telephone (713) 333-2436; 
fax (713) 333-2493. 
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MULTIPLE-VALUED 
LOGIC IN VLSI DESIGN 

edited by Jon T. Butler 

This book provides a historical per¬ 
spective on MVL, focuses on various 
technologies for implementing multiple¬ 
valued VLSI circuits, delves into appli¬ 
cations of MVL in diverse technologies, 
and discusses device physics, logic char¬ 
acteristics, and small and medium-scale 
circuit experiences. The text contains 
12 papers divided into four categories 
—Introduction, Multiple-Valued Logic 
Technology, Special Implementations, 
and Multiple-Valued Logic Tools and 
Techniques. 
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Workstations 


Entry- and mid-level 
workstations 

Users who have outgrown their per¬ 
sonal computers but aren’t ready to 
move up to high-end workstations may 
want to consider Sun’s IPX and EPX 
Sparcstations. 

The 16-Mbyte IPX (expandable to 64 
Mbytes) is a mid-level workstation de¬ 
signed for a variety of applications, in¬ 
cluding complex, computer-aided 
software development, financial analy¬ 
sis, electronic publishing, and network 
file/database access. It achieves 28.5 
MIPS and 4.2 Mflops with a rating of 
24.2 Specmarks. GX accelerated graph¬ 
ics speed window response time and 
allow users to manipulate multiple win¬ 
dows and complex objects in near-real 
time. IPX works as a single unit or con¬ 
figures as a file server for small work 
groups. 

The ELC yields 21 MIPS and 3 Mflops 
with 20.1 Specmarks. It offers multiple 
windows, a high-resolution display, fast 
response times, built-in networking, 
and 8 Mbytes of standard memory (ex¬ 
pandable to 16 Mbytes). It also config¬ 
ures as a file server. ELC integrates all 
of the workstation components into the 
back of a monochrome display. Sun 
Microsystems; $11,995 (IPX), $4,995 
(ELC). 
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25-MHz SBus Sparc 

Compstation 25 is an SBus-compat- 
ible, Sparc-compliant definition 1.1 
workstation that runs on Sparc/OS 
1.1.1. It yields 15.8 MIPS and 1.75 
Mflops with 10.25 Specmarks. The sys¬ 
tem features a high-resolution, 19-in. 
color monitor and 8 Mbytes of RAM 
(expandable to 64 Mbytes). It supports 
Sun View, Open Windows, and Motif 
software. An SBus add-on card allows 
users simultaneous access to Unix and 
DOS environments. Tatung Science 
and Technology; $7,995. 
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16-Mbyte Sparc CPU 

The power and software capability 
of a Sun workstation running SunOS 
is available to the embedded systems 
market in a 16-Mbyte Sparc CPU. The 
Sparc CPU-1E/16 is a VME single-board 
computer that offers enough on-board 
memory to run standard SunOS in a 
single-slot environment without a 
memoiy expansion board. It achieves 
12.5 MIPS and 1.4 Mflops with 8.4 
Specmarks. Force Computers; $7,995- 
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Computer boards and cards 


CMOS 68000 single-board 
computer 

Designed for use in remote or bat¬ 
tery-powered portable applications, the 
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MPL-4079 board consumes 120mA with 
the CPU running at 8 MHz. Six 32-pin 
JEDEC sockets can be equipped with 
up to 512 Kbytes of PROM and 256 
Kbytes of battery-backed RAM on a 25- 
sq in. CMOS RAM. 

The device also has two RS-232 
serial ports, a battery-backed clock cal¬ 
endar, two 16-bit timers, and a four- 
channel, 8-bit, A/D converter. The 
board’s G-64 bus interface expands 
I/O capabilities. It operates from -40° 
to +85° C. Software for the MPL-4079 
can be developed on DOS and down¬ 
loaded to the board’s memory using 
Crosscode C and PC bridge cross 
development software packages. 
Gespac; $595 (100s). 
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PCs can run Macintosh 
software 

DOS users can install the Andor One 
add-in board with associated software 
to run Apple Macintosh software. The 
chip’s circuitry allows a DOS machine’s 
3.5-in. drive to directly read and write 
Macintosh disks. The full-length PC 
card installs in any PC, from the 
earliest PC-XTs through 486-based 
machines. Andor One has terminate- 
and-stay-resident software that takes 60 
Kbytes of RAM. The package also 
includes Word-for-word/Mac, a DOS- 
Mac translator. 

According to the company, Andor 
One runs twice as fast as a Macintosh 
Classic. It is compatible with DOS- 
standard peripherals such as mouse, 
keyboard, hard disk, 3-5-in. disk drives, 
and monitors. An Apple Talk-compat¬ 
ible RS-422 connector allows it to net¬ 
work with Apple LaserWriters, Apple 
Local Talk, Phonenet, and other net¬ 
working devices. The board requires 
Mac-plus ROM and Macintosh Systems 
and Finder software. Hydra Systems; 
$995. 
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Enhanced OS-9 system 

OS-9/MVME167 Development Pak is 
an optimized version of the OS-9 real¬ 


time operating system. Its manufacturer 
says it takes full advantage of the power 
of Motorola’s 32-bit M68040 processor 
while providing support for the 
advanced on-board serial, SCSI, and 
Ethernet hardware. 

The package includes a number of 
OS-9 device drivers for I/O peripher¬ 
als included on the MVME167 board 
family, including SCSI drivers for the 
NCR 53C710 controller. Other drivers 
support the on-board, real-time clock 
and the CD-2401 serial I/O controller. 
An Internet Support Package supports 
Ethernet. Also included are a set of resi¬ 
dent development tools for application 
development, including K&R C com¬ 
piler, macro assembler and linker, user 
state debugger, pMACS full-screen edi¬ 
tor, and shell command interpreter. The 
system also comes in a modified Run¬ 
Time Pak that excludes device drivers 
and development tools. Microware Sys¬ 
tems; $3,000 (Development Pak), 
$1,500 (Run-Time Pak). 

Reader Service No. 15 



Microware's OS-9 Development Pak 


Classic runs twice as fast 

The Classic Performer boosts the 
speed and performance of the Macin¬ 
tosh Classic by as much as 96 percent, 
its manufacturer says. The accelerator 
board is a 68000 processor that runs at 
16 MHz, twice the clock speed of the 
Classic’s built-in processor. It has a 64- 
Kbyte SRAM cache circuit and 25-ns 
PAL chips. The manufacturer says it 
speeds math calculations by up to 75 
percent and accelerates the SCSI port 
by 15 percent. An optional math 


coprocessor boosts math-intensive pro¬ 
grams by as much as 5,000 percent. It 
supports Macintosh systems above 
6.0.7. Harris Laboratories; $299.95 
(Classic Performer) $149-95 (68881 
math coprocessor). 
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Communications software 


Data compression for micro- 
to-mainframe conversion 

Version 2 of Commpress is an 
advanced bit compression/decompres¬ 
sion utility that processes text and 
binary files and offers ASCII/EBCDIC 
conversion tables. The manufacturer 
says it reduces size and transfer time 
of text files by 50 to 80 percent. 

Commpress is compatible with most 
micro-to-mainframe products with file 
transfer capabilities. Version 2 also sup¬ 
ports OS/2 and Macintosh workstations. 
It is available in VM/CMS, DOS/VSE 
CICS, and MVS (TSO and CICS) ver¬ 
sions. Telepartner International; from 
$6,000. Free upgrades for customers on 
maintenance. 

Reader Service No. 17 

Network management 

The two products in the Network 
Control Series offer managers the abil¬ 
ity to control their enterprisewide net¬ 
works at central and local levels. 
Cornerstone Agent allows subnet man¬ 
agers to monitor traffic on the network, 
filter data, generate statistics, and 
decode protocols. It identifies poten¬ 
tial LAN problems or interfaces with 
Foundation Manager to allow a central 
administrator to troubleshoot any LAN 
on the network. Foundation Manager 
has in-depth analysis and management 
capabilities including statistical analy¬ 
sis, network characterization, simula¬ 
tion, auto-baselining, protocol analysis, 
and network visual mapping. 

Both products are based on a graphi¬ 
cally programmable user interface and 
built-in start-up programs. The series 
supports MIB I, MIB II, and Remote 
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Network Monitoring MIB. It also sup¬ 
ports Token Ring source routing and 
IP routing. Pro Tools; $8,995 (Founda¬ 
tion Manager), $1,295 (Cornerstone 
Agent). Upgrades for $500. 
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VAX-to-IBM connection 

VAX users can connect directly to 
IBM’s LU6.2/Token Ring with E2 
Bridge. This software allows VAX 
users on PCs or VT terminals to log on 
to a mainframe as a 3270 user and 
access mainframe applications such as 
CICS and TSO. VAX users can also print 
DOS-format files on their local print¬ 
ers and have peer-to-peer communi¬ 
cations with LU6.2 systems. 

The EZ Bridge family includes three 
products. The SNA/3270 connects VAX 
to mainframe by emulating an IBM 
3174 or 3274 cluster controller, 3278/9 
terminals, and 3287 printers. The SNA/ 
LU6.2 implements IBM’s LU6.2 proto¬ 
col that provides peer-to-peer commu¬ 
nications capabilities on an SNA/Token 
Ring network. EZ Bridge OLTP allows 
users to access and update data in dif¬ 
ferent locations. Systems Strategies; 
$3.000 to $15,000, based on VAX size. 
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Multimedia 
communications chip 

Fax, data, voice, and caller ID capa¬ 
bilities are combined on an LSI chip, 
the Fax Vodem (YTM403). The chip 
enables users to send and receive 
graphics, data, and voice messages on 
a single line and incorporate caller ID 
if desired. It supplies voice quality with 
resolution up to 12 bits at 9,600 bps 
and dynamic range up to 72 decibels. 

A built-in voice record-and-playback 
capability provides a 12-to-4 bit, three- 
to-one compression and decompres¬ 
sion ratio. An internal analog-to-digital/ 
digital-to-analog converter supports 
voice-mail capability. A caller ID func¬ 


tion allows receivers to identify the 
sender before answering the phone or 
receiving a communication. The chip 
comes in 64-pin SDIP or 64-pin QFP 
versions. Yamaha Systems Technology; 
$40 (1,000s). 
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10 Base T through AUI port 
on Ethernet 

The Model 177 transceiver attaches 
to Ethernet equipment to upgrade LAN 
coaxial Ethernet adapters and wiring 
networks to the 10 Base T wiring stan¬ 
dard. It uses unshielded, twisted-pair 
wiring and connects to Ethernet 
through an AUI port or via an Ethernet 
transceiver cable. 

The unit has four status LEDs indi¬ 
cating link, collision, receive, and trans¬ 
mit. Power derives from the AUI port, 
making an external source unneces¬ 
sary. The 10 Base T port uses an RJ45 
connector. Telebyte Technology; $119, 
discounts for quantities. 
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Chips and components 


3V CMOS EPROMs 

The 27LV256 and the 27LV512 are 
3V CMOS EPROMs designed for bat¬ 
tery-powered applications in which 5V 
devices are no longer preferred. The 


88 IEEE Micro 


manufacturer says the 3V EPROMs pro¬ 
vide 200-ns access times. 

The 27LV256 and 27LV512 are avail¬ 
able in speeds of 200, 250, and 300 ns. 
Both devices come in plastic dip, PLCC, 
and SOIC packages. Microchip Tech¬ 
nology; 27LV256from $4.73, 27LV512 
from $8.03 (1,000s). 
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Hardware-based fuzzy-logic 
controller 

The NLX230 is a single-chip fuzzy- 
logic inference engine with on-chip rule 
memory. The manufacturer says it pro¬ 
cesses 30 to 40 times faster than com¬ 
parable software-based or hardware/ 
software hybrid fuzzy-logic control so¬ 
lutions. The manufacturer recommends 
it as a replacement for conventional 
PID controllers, sequencers, state ma¬ 
chines, and intelligent timers. 

The controller is housed in a 40-pin 
package. It supports eight inputs, eight 
outputs, and a resident base of 64 
fuzzy-logic rules. The manufacturer also 
offers an applications development 
system that includes the chip and 
associated circuitry on a PC-compat¬ 
ible board, controlling software, and 
documentation. Neura Logix; $4 (pro¬ 
duction quantities), $395(development 
system). 
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From the 
Editor-in-chief 



One year later 


Now that I’ve been ieee 

Micro's EIC for one year, it 
is time to check the balance 
sheets; in my introduction of 
February 1991 I expressed 
some of my “good inten¬ 
tions.” The last two words in 
the first paragraph in that 
introduction (I’m sure you 
keep old issues, so you will 
know what they were) were 
surely true. 

In this year some things 
have been accomplished, 
while others still must be 
done. I am sure that IEEE Micro has been a good 
channel to bring a lot of useful information to 
you readers. I myself learned a lot of things. First 
of all I learned how important the work of the 
managing editor and staff is to the magazine. I 
understand now that a magazine could survive 
without an EIC but not without the staff. 

I also learned that Europeans can directly com¬ 
municate with people in the US only in a nar¬ 
row time window (and that everybody in the US 
makes use of an answering machine!). Some big 
organization should stait a project to unify the 
time on a world basis. (It would be quite easy: 
You modify the sun into a ring-shaped variable 
star with the earth in. the center.... Sometimes I 
think that such a project could be simpler than 
trying to make a magazine exactly as we think it 
should be.) 

Some members of the editorial board com¬ 
pleted their terms—we must thank them for their 
contributions—and others joined the board. Our 
editorial board is growing with capable people 
from academia and industry, while it keeps the 
international balance between the US, Europe, 


and the Far East. 

A magazine results from the combined efforts 
of authors, editors, staff, and readers. The first 
three groups of people depend on the last— 
you, the readers—to continue to support the 
magazine by providing proposals, comments, 
and suggestions. As in the past, we promise to 
always consider these comments. 

What changed in the magazine during 1991? 
Costs were reduced (credit to our staff), allow¬ 
ing us to bring you the most pages possible in 
spite of inflation and general economy problems. 

We have also devoted efforts to reducing the 
amount of time spent in reviewing manuscripts. 
The shorter review cycle serves the needs of 
authors (their work reaches a wide audience in 
a shorter time, or at least they know their fate 
sooner) and readers (they receive updated in- 
fonnation quickly). The theme issues in 1991 
brought coordinated sets of articles on some of 
the hottest areas in microelectronics and 
microsystems. This will continue in 1992 and 
1993- (See p. 41 for clues.) 

The current issue is a “general” one, so you 
find an assortment of articles that address vari¬ 
ous aspects of high-perfonnance computing. The 
first, “The Scalable Coherent Interface and Re¬ 
lated Standards Projects” by Dave Gustavson, 
provides an insight into an IEEE project (PI596- 
SCI). SCI proposes new solutions for the com¬ 
munication bottleneck intrinsic in bus-based 
multiprocessor architectures. Instead of wider and 
faster buses (which also means more expensive 
and power-hungry buses), we can switch to 
nonbused interconnection schemes and still keep 
the configurability we generally find in bused 
systems. SCI is actually a set of projects address¬ 
ing the many aspects of an interconnection struc¬ 
ture, from the lowest level (the physical 
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communication channel) to higher pro¬ 
tocol layers dealing with cache coher¬ 
ency. It ensures tight coordination with 
other standards projects. 

Next Daniel Mann’s “Unix and the 
Am29000 Microprocessor” article shows 
how the hardware features of a RISC 
processor can be exploited to optimize 
its performance for a given operating 
system. 

With the third article we switch to 
completely different architectures. 
“Hardware Requirements for Neural 
Network Pattern Classifiers: A Case 
Study and Implementation” by Boser 
et al. describes in detail an ASINC, or 
application-specific integrated neural 
circuit. This is so new an acronym that 
it does not yet appear in Carl Warren’s 
glossary (see Micro Standards begin¬ 
ning on p. 69). 

Efficient and affordable handwritten 
character recognition can open the way 
to a variety of new industrial applica¬ 
tions. I am convinced that we will see 
such features within the sensor itself 
in a few years (the same way keyboards 
now have embedded controllers that 
send ASCII characters directly to the 
host microprocessor). There is a good 
chance that such intelligent sensors will 
use the neural approach, so take a close 
look at this article to get an idea of 
what will be on the market in the near 
future. 

A common denominator for these 
three articles is that each of them 
achieves high performance, thanks to 
some specific silicon basis (protocol 
chips, RISC, neural hardware). The next 
article, “Experimentation with Hyper¬ 
cube Database Engines” by Frieder et 
al., describes efficient algorithms for 
handling distributed shared data in a 
specific architecture. 

Even the best hardware and the best 
software can provide good results only 
if they can work correctly. In a modu¬ 
lar system, this means each module 
must comply with the specification of 
the common interface (the bus). The 
last article in this issue, “Conformance 
Testing of VMEbus and Multibus II 


Products” by Adams et al., describes a 
technique used to verify compliance 
of boards to a given bus specification. 



UL 



In the mailbag 


(LK: liked; DLK: disliked LTS: like to 
see) 

In this mailbag the total number of 
readers who did not like split articles 
amounts to nine. Their comments are 
not individually reported here; I think 
they already received their answer in 
December. Thank you all for the ad¬ 
vice.—D.D.C. 

February 1991 

LK: Hardware design; LTS: network 
systems.—S.T., Moscow 

April 1991 

LK: Optical architecture: excellent 
article, though 1 think there were 
some errors in certain figures and not 
enough clarity in parts of the text. 
(Please be more precise, so we can 
inform the author and ask for correc¬ 
tions if required.—D.D.C.) LTS: RISC 
architectures.—J .C.A.A.,Tlalnepantla, 
Mexico 

LK: Software and hardware com¬ 
puter science and its electronics; LTS: 
... U.P.S. units for computers.— 
M.M.M., Alexandria, Egypt 

LK: Letters, Micro News, Micro 
View, and other chapters; LTS: a chap¬ 
ter [identifying] computer idioms.— 
M.G., Isfahan, Iran (An acronym 
dictionary appears in this issue.— 
D.D.C.) 

June 1991 

LK: Richard Stern, Book Review— 
T.S., Trornso, Norway 

LK: On The Edge; it has focused 
very well on the problem. Moreover, 
author used simple language to de¬ 
scribe it... very important... to teach 
something. I'd like to see more ar¬ 
ticles like this.—C.G., Verona, Italy 


(These compliments are for Carl War¬ 
ren and James Gafford, who were 
responsible for this column.—D.D.C.) 

LK: The whole issue; DLK: “Light 
at the end of the chunnel.” It has 
nothing to do with the hot chips 
[theme]. (You are correct: Depart¬ 
ments are not necessarily linked with 
special themes.—D.D.C.) LTS: Hot 
Chips III—T.P., Warsaw (You will get 
it in the April issue.—D.D.C.) 

LK: The overview of the BTRON/ 
286 specs.—E.D., Aalen, Germany 
LK: Enjoyed reading [about] iWarp 
and Datawave. How much does the 
iWarp and ITT Datawave chip cost? 
[The task of the designer—who writes 
the articles—is to minimize the manu¬ 
facturing cost, but they usually have 
little control on final pricing. It is bet¬ 
ter to ask a commercial representa¬ 
tive.—D.D.C.) DLK: the guest editors' 
introduction of what is superscalar 
and pipelining; LTS: How supersca¬ 
lar and pipelining come into play. Do 
all chips have one or could they have 
both? This was not clear.—A.R.F., San 
Ramon, CA (Almost all microproces¬ 
sors announced in the last five years 
are pipelined; only a few recent ones 
are also superscalar.—M.D.H./ 
D.A.W., guest editors.) 

DLK: Rate monotonic scheduling, 
too little tutorial information.—P.F., 
Adliswil, Switzerland 

LK: iWarp: A 100 MOPS....—S.R.E, 
Saudi Arabia 

August 1991 

LK: Micro Law; I never miss it.— 
T.D.L., Cupertino, CA 

LTS: More practical software reports 
for science and engineering, Unix.— 
F.M.R., Munich 
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Richard H. Stern 
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Maier & Neustadt, P. C. 

1755 Jefferson Davis 
Highway 
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Arlington, VA 22202 


Engineers can be disqualified, too. 


E uring the heyday of the mergers-and- 
acquisitions frenzy, the disqualification 
ploy became a common litigation tactic. 
The first thing a party would do in a lawsuit 
would be to file a motion that the other side’s 
lawyer should be disqualified from representing 
his or her client, because at some time or an¬ 
other counsel had performed some kind of legal 
representation of the adversary party who made 
the motion. 

The rationale for filing such motions, and for 
the court when it went along with the move, 
was to prevent the appearance of impropriety. 
Never mind that the earlier legal representation 
had nothing to do with the present case. The 
public’s great confidence in the legal profession 
might be impaired if it even appeared that cli¬ 
ents might not be able to rely on the confidenti¬ 
ality of their relationship with their attorneys. How 
could clients feel safe hiring lawyers and telling 
them their secrets if the very same secrets would 
later be used against them to help an adversary? 

After a while, the courts became as cynical 
about these motions as those who filed them 
and recognized them as another ploy to put the 
other party at a tactical disadvantage. They be¬ 
came skeptical about arguments based on the 
body blow to our social order wrought by the 
appearance of impropriety and began to limit 
disqualification to situations where a party’s ad¬ 
versary would actually get the advantage of rel¬ 
evant business secrets disclosed in confidence 
in an earlier case. Such things as knowing the 
structure of the client’s hierarchy of values, the 
client’s aversion to or enthusiasm for risk, how 
the client thinks about things in general, or the 
client’s negotiating strategies, all went out the 
window as bases for disqualification. 

The rule now seems to be pretty close to this: 


The lawyer must really know (or be likely to 
know) where the relevant bodies are buried. For 
example, the court may disqualify counsel if the 
lawyer 

• wrote the patent application on the inven¬ 
tion that the lawyer now wants to assert 
(on behalf of an adversary of the inventor) 
is invalid; 

• is going to be a witness, because he or she 
wrote the patent application that suppos¬ 
edly was used to defraud the patent office; 

• represented the patent owner in another 
case involving a related patent, where coun¬ 
sel learned about the weaknesses of the 
client’s claim on this technology; or 

• represented one party to a license in work¬ 
ing out and drafting the agreement, where 
parties are now in a dispute over what the 
agreement was intended to do (what it 
means), and the lawyer wants to represent 
the other party in the dispute (or maybe 
now the lawyer plans to claim that the origi¬ 
nal license agreement is legally invalid). 

While things are pretty well sorted out on that 
front, a new frontier is opening up, which may 
be of greater concern to electrical engineers and 
computer science professionals. It now appears 
that they, too, can be disqualified. 

The Balde case. In a decision handed down 
last summer, Wang Laboratories, Inc. v. Toshiba 
Corp a federal court in Alexandria, Virginia, 
(see box) disqualified a computer consultant from 
representing NEC (a codefendant of Toshiba) 
because he had previously given Wang a pre¬ 
liminary opinion that its patent was invalid. In 
November 1990, Wang’s attorney had telephoned 
John Balde, a computer consultant. He asked 
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Balde if he was familiar with SIMM 
(single in-line memory module) tech¬ 
nology, and Balde said he was. The 
parties differ over what happened next. 
According to Wang, the lawyer retained 
Balde and agreed to pay him for his 
time. According to Balde, he was not 
retained, and he told the lawyer he 
would have to determine whether 
Wang's SIMM patents were valid be¬ 
fore he would enter any agreement 
with the company. 

Wang’s lawyer then sent Balde a let¬ 
ter (dated November 14) transmitting 
various documents: the SIMM patents, 
some prior art publications, some ma¬ 
terials concerning patent infringement, 
and a long memorandum written by 
Wang’s attorney about the history of 
the prosecution of the patents before 
the patent office. He asked Balde to 
review the material, “so that we can 
discuss how best to explain the advan¬ 
tages to a computer designer of using” 
SIMM, and he suggested they meet af¬ 
ter Balde reviewed the material. 

The next day—watch out for this 
one, readers—Wang’s lawyer sent a 
second letter. Unlike the first one, it 
was prominently labeled, “Confiden¬ 
tial Attorney-Work Product.” This let¬ 
ter (dated November 15) contained an 
outline of potential legal defenses 
against Wang’s suit, as Wang’s lawyer 
perceived them, and asked Balde to 
provide his opinion on various issues. 
It also said that this material “will assist 
your review of the material included 
in my November 14, 1990, letter.” 

Several telephone conferences fol¬ 
lowed. The lawyer later said that in 
them he disclosed Wang’s confidential 
information to Balde, and that he made 
the confidentiality clear to Balde. Balde 
did not deny this, but said that he made 
no use of the material in the Novem¬ 
ber 15 letter. He considered the letter 
premature because it asked him to give 
opinions on specific litigation issues. 
But he did not want to give any opin¬ 
ions on the issues until he had made 
up his mind about the validity of the 
SIMM patents. If they were not valid, 


Alexandria's 
"rocket docket" 

Last summer, the US District 
Court for the Eastern District of 
Virginia awarded Wang $3-3 mil¬ 
lion in damages against Toshiba 
and NEC for patent infringement. 
The case is of special interest be¬ 
cause it comes from the “rocket 
docket” of Alexandria, Virginia. 
This court is becoming a forum of 
choice for patent infringement liti¬ 
gation and other complicated 
cases that usually drag on for 
many years, because of its firm 
policy that all cases must be tried 
within six months after joining is¬ 
sue. Presumably, one can expect 
the same kind of rulings about 
experts in other cases brought in 
that court. 


he did not want to become involved 
with Wang. Balde felt that Wang’s law¬ 
yer was jumping the gun on enlisting 
Balde in Wang’s camp before he felt 
ready to sign up. Nevertheless, as the 
judge pointed out in his opinion, Balde 
did not write back to Wang’s lawyer 
saying any of this. 

Balde studied the matter and con¬ 
cluded that the SIMM patents were in¬ 
valid. On December 10 he telephoned 
Wang’s lawyer, told him his conclu¬ 
sion, and said that he therefore did not 
want to act as a consultant for Wang in 
the case. The lawyer asked Balde for a 
report, which Balde sent two days later. 
In his cover letter for the report, Balde 
wrote, “As you know, I have read ... 
the Work-Product information on the 
two Wang SIMM patents,” and he 
thanked Wang for offering to pay his 
$1,500 invoice for time spent on the 
task. Subsequently, NEC retained Balde 
as its technical expert in the case. Wang 
then moved to disqualify Balde. 

Court’s decision. The court found 
little precedent about disqualification 
of experts, but held that it had an in¬ 


herent power to disqualify them in ap¬ 
propriate cases. This power exists to 
help the court fulfill its judicial duty to 
protect the integrity of the adversary 
legal process and promote public con¬ 
fidence in the fairness and integrity of 
the legal process. 

The court found two issues: 1) Was 
it reasonable for Wang to think it had 
a confidential relationship with Balde? 
2) Did Wang disclose confidential in¬ 
formation to Balde? If and only if both 
answers are affirmative, Balde should 
be disqualified. It also recognized that 
lawyers might seek to disable poten¬ 
tially troublesome experts merely by 
retaining them without using them, and 
said that it would not countenance that 
ploy. (Here, Balde had been NEC’s 
consultant in the past.) On the other 
hand, consultants who do not want to 
be bound by a duty of confidentiality 
should make it clear that they do not 
want to be retained until they make 
up their minds, and in the meantime 
they should not accept confidential 
disclosures. 

In this case, the November 15 letter 
carried the day for Wang. The letter 
was captioned “confidential attorney- 
work product,” and an examination of 
the enclosures confirms that descrip¬ 
tion. The court said, “No experienced 
litigator would freely disclose these ma¬ 
terials to opposing counsel.” Balde’s si¬ 
lence in the face of the November 15 
letter established that Wang’s lawyer 
was reasonable in thinking that a con¬ 
fidential relationship existed, and in fact 
that letter transmitted confidential in¬ 
formation to Balde. 

Advice. Neither Wang nor Balde had 
“acted inappropriately,” the court 
found. Even so, it decided to offer some 
free advice for others “to avoid a rep¬ 
etition of these unfortunate circum¬ 
stances.” First, lawyers should make it 
clear and confirm in writing that a con¬ 
fidential relationship will be created. 
Preferably, this should be specifically 
explained in a letter, along with con¬ 
firmation of payment terms and condi¬ 
tions. These elements were missing in 
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this case. Nevertheless, the court dis¬ 
qualified Balde. 

Next, consultants should take care 
to avoid creating confusion about their 
position. Doubts about wanting to be 
retained should be unequivocally ex¬ 
pressed. Here, the court said, “given 
his stated [read that as ‘alleged’] con¬ 
cerns” about patent invalidity, Balde 
needed no more than the identity of 
the parties and the patent numbers. 
(The last is a bit of judicial overkill. 
Since the patents are public documents, 
their text is no secret. It would be an 
inconvenience to make Balde get his 
own copies.) He should have declined 
to accept anything more, the court said. 

Counsel should ask the consultant 
whether his past employment creates 
any problem. Since NEC apparently did 
not make this inquiry of Balde before 
retaining him for the case against Wang, 
the court felt that NEC had only itself 
to thank for its problem of having no 
expert on the eve of trial. (Presumably, 
Wang, too, should have asked Balde 
whether his past work for NEC gave 
him any of NEC’s secrets, which could 
make it improper for him to be Wang’s 
consultant.) 

Also, NEC should have promptly 
advised Wang and discussed the mat¬ 
ter “thoroughly in an effort to resolve 
the dispute before it is raised in court.” 
(To which I say, lots of luck. Why 
would Wang agree to roll over on any¬ 
thing? Excess of gentlemanliness? When 
was the last time you heard of noblesse 
oblige being a significant factor in how 
litigators behave?) 

To be sure, the court said, experts 
“are not advocates; they are sources of 
information and opinions in technical 
[matters] ... Yet, when experts are re¬ 
tained in connection with litigation, 
they must operate within the constraints 
of, and consistent with, the adversary 
process.” 

I am not going to express any opin¬ 
ion on whether Wang’s lawyer sand¬ 
bagged, hoodwinked, or otherwise 
mistreated Balde and NEC, nor on 
whether Balde got a deserved come¬ 


uppance. The court apparently felt that 
none of that happened, which is, of 
course, quite good enough for me. But 
the lawyer certainly outmaneuvered the 
EE in this case. 

Protecting yourself. So where does 
all of this leave other EEs? What do 
you do to “avoid a repetition of these 
unfortunate circumstances”? Or, exer¬ 
cising 20-20 hindsight, what should 
Balde have done? What would the CYA 
Manual for Electronic Engineer and 
Computer Science Would-be Expert 
Witnesses have prescribed, if anything? 
Remember, Balde was not trying to 
scare Wang away. He didn’t know 
whether NEC would want to retain him 
in this case, and he had to pay the rent 
every month. He also did not know, 
presumably, how he felt about the 
SIMM patents and Wang’s case. 

... consultants ... 
could be 
"Balde-ized." 

Ideally, Balde would have had his 
own lawyer, whom he would have 
consulted immediately after the No¬ 
vember 14 telephone call. He should 
have turned the November 14 and 15 
letters over to a lawyer, unread and 
preferably unopened. Balde’s lawyer 
would have responded for him or ad¬ 
vised him how to respond. He would 
have matched Wang’s lawyer’s efforts 
with the same thing in reverse, some¬ 
thing like the following: 

F, U, & D, Counselors-at-Law 
November 16, 1990 

Dea r Sir: 

My client, Double-E Balde, has 
turned over to me your letters of No¬ 
vember 14 and 15 and enclosures, 
which he has not read. My client wishes 
you to be advised that he is not yet fully 


prepared, at this time, to enter into a 
relationship of confidentiality with re¬ 
gard to the subject matter of Wang v. 
Toshiba and NEC. In all fairness to 
Wang, my client feels that he should first 
make his own preliminary evaluation 
of materials already of public record, 
or otherwise not secret or confidential, 
in order to satisfy himself that he can 
appropriately consult with, or act as a n 
expert for, Wang in this matter. I am 
sure you will appreciate that it would 
not be in Wang s or his interests to have 
him prematurely enter into a confiden¬ 
tial relationship with Wang, in the event 
that it later appears he would he obliged 
to testify as to opinions inconsistent with 
Wang's position in this matter. 

We will get hack to you as soon as 
Mr. Balde has finished the foregoing 
preliminary evaluation. In the interim, 
I will retain the docu ments, unless you 
wish me to return them to you at this 
time. 

Sincerely yours, 

F, U, &D 

Of course, you may feel it is not fea¬ 
sible to run a consulting business this 
way. What then? You consultants who 
don’t want to run your business 
through lawyers still must do some¬ 
thing, or you could be “Balde-ized.” At 
the very least, you should send a let¬ 
ter, such as the following, that nega¬ 
tives a confidential client relationship 
like the one set up for Balde by the 
lawyer’s letter. 

Bit Bucket Consulting Services, Ltd. 

November 16, 1990 

Dear Counselor: 

Thank you for your letters of Novem¬ 
ber 14 and 15, 1990, regarding the 
pending patent infringement suit be¬ 
tween Wang and Toshiba/NEC regard¬ 
ing SIMM technology. Iam responding 
to both letters at the same time, because 
your second letter arrived before I could 
continued on p. 72 
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R&D in Japan 


hese two reports and two announcements 
reflect some of the ongoing research in 
this country. 

Forecast for 2010 

Japan’s Economic Planning Agency enlisted tire 
assistance of a group of 10 experts to assess tire 
countiy’s position in technology and its direction 
for the next 20 years. After listing 101 technologi¬ 
cal items, the study group developed a question¬ 
naire concerning them, combined the responses, 
and wrote a lengthy report and summary (in Japa¬ 
nese). The summary, released in July 1991, 
proved fascinating in the detailed views of the 
group members but produced some problems for 
readers. For example, the study may not have 
much statistical validity, given that one or at most 
two experts assessed the individual items. In ad¬ 
dition, parts of the textual material were awk¬ 
wardly phrased. 

But to me, the most interesting portions of the 
report appear in its tables. Table 1 lists selected 
items from the report’s tables. 

Micromachines 

Under Japan’s National Research and Devel¬ 
opment Program (popularly known as the Large- 
Scale Project), industry, government, and 
academic circles cooperate on research and de¬ 
velopment of innovative, advanced, large-scale 
industrial technologies deemed important and 
urgent for the national economy. Since the pro¬ 
gram began in 1955, 29 projects have been 
launched, eight of which continue today. MITI 
(Japan’s Ministry of International Trade and In¬ 
dustry) proposed an R&D program called 
Micromachine Technology to begin in fiscal year 
1991. 

The New Energy and Industrial Technology 
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Development Organization (NEDO) under the 
authority of the Agency of Industrial Science and 
Technology will conduct the Micromachine Tech¬ 
nology project. NEDO plans to establish the tech¬ 
nologies necessary for the realization of 
micromachines. 

Elemental technologies. The first step of the 
program (covering the first four to five years) 
emphasizes the establishment of basic technolo¬ 
gies of elemental components for micromachines 
and targets four major R&D items. The first major 
item in this phase will establish technologies for 
material processing and control of dynamic 
mechanisms by developing the elements of a 
micromachine. The elements are actuators (ther¬ 
mal-, electrical-, or magnetic-induced deforma¬ 
tion, electrostatic, hydraulic) and sensors 
(electromagnetic, chemical). 

A second item calls for the development of 
technologies that transform energy from external 
sources, micro internal batteries, or micro power 
generators. A third will investigate micromachine 
communication with the exterior world and con¬ 
duct theoretical research and associated software 
development of remote control and coordinated 
distributed control. 

A final major item concerns the measurement 
of accuracy and movement of the micromachines 
for evaluation of the results of the R&D program. 

R&D description. Though micromachines are 
small, they are complex systems with advanced 
functions. Some of the areas that need to be re¬ 
searched include the basic theories that underpin 
miniaturization methods, structural analysis, ma¬ 
terials, and component technologies of micro¬ 
scopic processing and assembly. Other areas 
include the techniques for producing microscopic 
sensors and control circuits, and the system tech¬ 
nology required to perform microscopic motion 


February 1992 7 












Software Report 


Table 1. Identified technologies and application year. 

Technology 

Year 

Information electronics 


Biosensor 

2000 

Superparallel computer 

2010 

Terabit optoelectronic file 

2010 

Superintendent chip 

2010 

Terabit optocommunication device 

2010 

Automatic translation systems 

2020 

Virtual reality system 

2020 

Self-replicating database system 

2020 

Neurocomputer 

2030 

Terabit memory 

2030 

Self-replicating chip 

2050 

New materials 


Ceramics gas turbine 

2000 

Magnetic materials 

2010 

Hydrogen occlusion alloy 

2010 

Optical IC 

2010 

Superlattice devices 

2010 

Nonlinear optoelectronics 

2020 

Superconductors 

2030 

Automation 


Micromachines 

2010 

Concurrent engineering 

2010 

Intelligent CAD 

2020 

Communication 


TV conference system 

1994 

TV telephone 

1994 

Optoelectronic LAN 

1995 

Broadband ISDN switches 

1995 

HDTV 

1995 


and operation. Table 2 on the next page veloped in the 1970s and enabled the 


lists some of the main topics envisioned 
for micromachine technology. 

The technology. Microscopic ma¬ 
chines or instruments with advanced 
functions can perform minute tasks or 
work in extremely narrow spaces. Their 
small size allows them to be applied to 
a wide variety of areas, including medi¬ 
cine, biotechnology, and industry. 

Recently, silicon micromachining 
technology, in which micrometer me¬ 
chanical structures are formed on sili¬ 
con wafers, opened up this field. The 
technology emerged from etching, 
deposition, and other lithographic tech¬ 
niques for microprocessing silicon de- 
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production of cantilevers, diaphragms, 
and other simple mechanical compo¬ 
nents. These products are now used 
widely as pressure sensors or are begin¬ 
ning to be commercialized as accelera¬ 
tion and flow sensors. Moreover, recent 
advances in semiconductor 
microprocessing and ultraprecision 
processing technology allow the cre¬ 
ation of mechanical parts far smaller 
than anything previously developed. 

Researchers only recently began to 
investigate micromachine technology, 
however, and need to overcome many 
technological barriers before such ma¬ 
chines can be developed. Some barriers 


are those related to friction, durability, 
strength, materials, and power sources 
and supplies. Researchers also must de¬ 
velop a number of other technologies, 
including ways to design, process, as¬ 
semble, and control micromachines. 

Micromachines today. We can ap¬ 
proach micromachine development in 
two ways. One approach uses technol¬ 
ogy in the field of mechanical engineer¬ 
ing, which makes existing mechanisms 
even smaller; the other uses Micro 
Electro Mechanical Systems (MEMS) 
technology, which uses IC production 
technology. Many researchers have 
proposed or built prototypes of various 
microactuators and microstructures that 
could serve as the component tech¬ 
nologies for micromachines. 

In the United States, schools like 
MIT, Stanford, and the universities of 
Wisconsin-Madison, Michigan, and 
California, Berkeley continue to 
research surface micromachining 
technology and LIGA (Lithografie, 
Galvanoformung, Abfonnung) process 
technology in their silicon and LIGA 
process research centers. 

MIT researchers produced a 100-fim- 
diameter polysilicon micromotor rotat¬ 
ing at 15,000 rpm and analyzed its 
movement. Researchers created the 
motor using the same process used for 
IC production. Wisconsin-Madison re¬ 
searchers produced 3D structures con¬ 
taining gears with 55-pm inside 
diameters. Researchers at Berkeley pro¬ 
duced a 120-gm-diameter electrostatic 
motor and verified that it does rotate. 
They can also measure friction coeffi¬ 
cients, one of the most difficult prob¬ 
lems in micromachine technology, 
using a micro electrostatic linear 
actuator. 

The US National Science Foundation 
supports these efforts. In 1988 NSF dis¬ 
tributed its micromachine research 
budget among eight universities and 
in 1989 provided funding to 11 
universities. 

Europe also supports several micro¬ 
machining facilities, including Ger¬ 
many’s Fraunhoferlnstitut, Techische 



































Universitat Berlin, Kernforschungs- 
zentaim Karlsruhe Institut fur Mikro- 
strukturetechnik, and the Netherlands 
university of Twente. Research concen¬ 
trates mostly on sensors. The 
Fraunhofer Institut fur Mikrostruk- 
turetechnik produced prototypes of a 
vibration sensor with 32 cantilever me¬ 
chanical resonators and a 1.5-mm x 
1.25-mm cantilever themial bimorphic 
microactuator. These facilities receive 
subsidies both from the European Eco¬ 
nomic Community and from their re¬ 
spective countries. 

Japan pursues numerous creative 
studies related to micromachine tech¬ 
nology. For example, the University of 


Tokyo developed prototypes of a micro 
Stirling engine with high thermal effi¬ 
ciency as a microactuator, Tohoku Uni¬ 
versity produced a microvalve using 
silicon. NTT Applied Electronics Labo¬ 
ratories produced a 500-nm x 500-nm, 
active integrated optical microencoder 
with 0.01-pm resolution. 

Research at many universities, na¬ 
tional research institutes, and private 
companies continues to produce proto¬ 
types. These include electrostatic linear 
actuators, micropressure sensors, micro 
IS-FET (ion-sensitive field-effect transis¬ 
tor) sensors, micromanipulators using 
piezoelectric-impact drive systems, mi¬ 
cro active catheters using shape 


Table 2. Main topics of micromachine research. 

Technologies 

Description 

Microscopic mechanical device 

R&D on structures, materials, machining 
techniques, integration techniques, 
and power supplies for microscopic 
mechanisms and functional 
components required for 
micromachines 

Development of technology to enable 
the production of various mechanical 
devices 

Microscopic sensors, control 

R&D in the technologies needed to 

circuits, and other techniques for 

produce extremely miniaturized 

miniaturized electronic devices 

electronic devices such as microscopic 
sensors and control circuits used in 
micromachines 

Control and operation 

R&D in motion control and operation 
technologies for microscopic 
mechanisms 

Measurement and evaluation 

Basic research into measurement 
methods, evaluation methods, and 
microscopic measurement technology 
as they relate to various component 
devices 

Support 

Basic research into support technologies 
including lubrication techniques for 
microscopic parts, theoretical 
simulation, and CAD/CAM 

System integration 

R&D project extending 10 years 
(FY1991-FY2000) and costing 
approximately 25 billion yen 


memory alloy (SMA), and more. 

These technologies, which mainly 
use semiconductor production tech¬ 
niques, represent only a small part of 
micromachine production technology. 
Research in this field is still in its in¬ 
fancy, and many problems remain to be 
solved. Some of the hurdles to be over¬ 
come include the development of pro¬ 
duction and process technologies 
geared specifically toward micro¬ 
machines and solving questions related 
to friction, durability, strength, materi¬ 
als, power supplies, and control. 

Application of results. Since 
micromachine technology will have a 
variety of applications, the program will 
focus on common component tech¬ 
nologies. Once developed these areas 
probably will result eventually in appli¬ 
cations like the following. 

Industrial micromachines. Industry 
faces the need to boost reliability and 
reduce maintenance costs for ever 
more-advanced and complex mechani¬ 
cal systems and equipment (power 
plants and airplane engines are two 
good examples). A tremendous need 
exists for technology that makes it pos¬ 
sible to perform inspections and repairs 
in extremely tight spaces without hav¬ 
ing to dismantle the entire system or 
equipment in question, such as plant 
pipe systems and airplane engines. 

Industrial micromachines will enable 
inspection and repair without requiting 
that plant equipment be dismantled. 
This capability will make it possible to 
perform early inspection and repair to 
minimize the extent of damage. We can 
therefore expect significant improve¬ 
ments in capacity utilization and main¬ 
tenance costs for electric power plants 
and other facilities. 

Medical micromachines. Today’s 
medical procedures do not sufficiently 
alleviate the pain experienced during 
diagnosis and treatment. In addition, as 
the population ages, we will see a 
strong demand for advanced medical 
equipment that lessens the physical and 
mental stress inflicted on patients. 

continued on p. 87 
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The Scalable Coherent Interface and 
Related Standards Projects 


The Scalable Coherent Interface (IEEE PI596) provides bus services by transmitting packets on 
a collection of point-to-point unidirectional links. Its protocols support cache coherence in a 
distributed shared-memory multiprocessor model, with message passing, I/O, and LAN com¬ 
munication taking place over fiber optic or wire links. Several ongoing SCI-related projects 
apply the SCI technology to new areas or extend it to more difficult problems. 


David B. Gustavson 

Stanford Linear Accelerator 
Center 



he Scalable Coherent Interface (SCI) 
was developed by a number of bus 
designers and system architects who 
had come to understand the funda¬ 
mental limits to bus technology during their work 
on Fastbus (IEEE 960) and Futurebus+ (IEEE 
896.x). These modem buses push bus signaling 
technology to its limits and provide various ar¬ 
chitectural features that support the use of mul¬ 
tiple processors. 

These bus limits are rapidly becoming a seri¬ 
ous problem as the demand for computing power 
continues to grow. The economic reality is that 
we can only meet this demand by using a large 
number of fast microprocessors. Buses, however, 
are inherently a bottleneck (only one transfer at 
a time), and their signaling speed is limited by 
the imperfect transmission lines that result from 
bus-style connections. 

Therefore, buses can’t support a large number 
of processors, especially not fast ones. While we 
can extend their useful life a bit by cleverness 
and brute force, the potential gains are relatively 
small and the costs become very high. For ex¬ 
ample, doubling the width of a bus does not 
double its speed because there are fixed over¬ 
heads associated with arbitration and addressing. 
Lengthening block transfers to reduce the effect 
of these overheads is of little use once the blocks 
exceed the size of cache lines. 


We can increase signaling speeds by shorten¬ 
ing the bus, but that makes it less useful. Reduc¬ 
ing the signal voltage helps, but eventually this 
solution experiences noise problems. Using mul¬ 
tiple buses to achieve more than one transfer at a 
time results in a complex (expensive) bus-bridge 
mechanism to maintain cache consistency (co¬ 
herence) in shared-memory systems that use bus- 
snooping technology. 

Paul Sweazey (Futurebus cache-coherence task 
group coordinator) started the Superbus Study 
Group in November of 1987 to see if there were 
potential solutions to these problems. In July 1988 
the outline of the solutions had become clear, 
and the PI596 SCI working group replaced the 
study group. The work was essentially completed 
by January 1991, when specification draft D1.00 
went out for ballot. Since then, it has been un¬ 
dergoing minor improvements, polishing, and 
debugging of the specification C code. (Most of 
the SCI specification is executable, to reduce 
ambiguity, simplify testing, and enable accurate 
simulation.) 

The resulting draft D2.00 1 recirculated to the 
balloting body in December. Since the draft passed 
with 92 percent affirmative, and all but one ob¬ 
jection has been resolved, (we refused to change 
the C code to Pascal), final approval by the IEEE 
Standards Board seems probable in March 1992, 
unless new objections arise. 
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SCI goals 

The SCI design goals include 

• Scalability , so that the same mechanisms can be used in 
high-volume, single-processor (or few-processor) sys¬ 
tems such as one might find in desktop machines, as 
well as in large, highly parallel multiprocessors (next- 
generation supercomputers); 

• Coherence, to support the efficient use of cache memo¬ 
ries in the most general and easiest-to-use multiproces¬ 


sor model, distributed shared memory; and 
• An interface, a standardized open communication ar¬ 
chitecture that allows products from multiple vendors to 
be incorporated into one system and interoperate 
smoothly. 

Scalability keeps costs down, not only through increased 
volume of production but through the simplicity of having to 
learn only one new paradigm—one that will work over sev¬ 
eral generations of machines. (See box.) 


SCI applications 


SCI uses point-to-point links to achieve very high speed 
communication. For the highest performance over short 
distances (typically within a cabinet), 16-bit-wide links run 
at 1,000 Mbytes/s. For I/O applications within a room, serial 
coaxial cable links run at 1,000 Mbps. For I/O over cam¬ 
pus distances of a few kilometers, optical fibers can carry 
the same serial bit stream. See Figure A. 

SCI's scalable architecture allows the same protocols 


to cover the range from internal communication within a 
multiprocessing supercomputer to local area network ap¬ 
plications. LAN communications look like moving data 
from one address to another in memory, a very simple 
software model. However, wide-area networks need hi¬ 
erarchical addressing instead of SCI’s flat 64-bit address 
model, so the usual software protocol translations are 
necessary. 



February 1992 11 
























































































































SCI 


A standard SCI module defines signals, connector, and 
power for operation at a link speed of 1,000 Mbytes/s. A 
standard cable and connector defines the use of these signals 
for applications in which the module form factor is not ap¬ 
propriate, over short distances (meters). A standard fiber¬ 
optic serial interface solves interface problems over longer 
distances (kilometers) at a link speed of 1,000 Mbits per sec¬ 
ond. The same bit stream may be sent over coaxial cable at 
lower cost for medium distances (tens of meters). We plan to 
standardize other speeds and signaling in the future as 
appropriate. 

Solving the bus signaling and bottleneck problems. 

SCI needed two fundamental changes from the way a bus 
transmits information. First, to make signaling speed inde¬ 
pendent of the size of the system, interfaces don’t wait for 
each signal to propagate (that is, a bus cycle) before sending 
the next signal. Each communication takes place by sending 
packets that include an address, command, and data as 
needed. While the propagation velocity of a packet is still 
limited by the speed of light, its rate of information transfer is 
not. As technology advances, the transfer rate can increase 
indefinitely. Fortunately, computer scientists know techniques 
that can compensate for the bad effects of delay (latency), 
but little can be done to compensate for a transfer rate (band¬ 
width) that is too small. (See Transaction phases box.) 


Secondly, SCI uses multiple signal paths (links) so that 
multiple independent transfers can take place concurrently. 
For high performance, designers can use separate links for 
each processor, memory, or I/O device. 

We further refined the SCI link design by applying lessons 
learned in practical multiprocessor systems. The link signals 
are differential because differential signals produce the least 
system ground noise and are least sensitive to noise from 
other sources. The links are unidirectional because bidirec¬ 
tional links create noise when the drivers are turned off or on 
to reverse the direction of the link, and because turn-around 
delays increase with cable length, a scalability problem. The 
links are fast and narrow because we expect pins will always 
be relatively expensive. 

SCI does not use reverse-direction flow control signals 
because such mechanisms make the amount of buffer stor¬ 
age needed in the interfaces dependent on cable length. To 
reach high speeds, we use low-voltage differential signals. 
SCI initially uses 16-bit-wide, ECL-compatible signals because 
of the industry experience with and support for that stan¬ 
dard. Future links will probably use even lower voltages that 
are chosen for compatibility with VLSI CMOS or GaAs circuitry. 

Thus each SCI interface (node) has (at least) two links, one 
incoming and one outgoing. The links run continuously, send¬ 
ing idle symbols when no packets are being transmitted, so 
that the receiver can remain perfectly synchro¬ 
nized at all times, ready for action. SCI pack¬ 
ets do not need the prologue that is essential 
for Ethernet or similar networks. 

In a high-performance SCI system, a ven¬ 
dor-dependent switch accepts packets from 
nodes and routes them to the appropriate other 
nodes as specified by the address in the packet. 
SCI does not specify the details of such a 
switch, because many cost/performance trade¬ 
offs exist. 

Lowest cost SCI systems, such as desktop 
systems, typically will use a ring connection 
instead of a switch, connecting one node’s 
output link to its neighbor’s input. The deci¬ 
sion to support rings made the interface cir¬ 
cuit more complex because a node may 
receive a packet intended for some other node 
and have to pass it along. That process re¬ 
quires some buffer memory and some address 
recognition logic. However, the result is that 
one interface definition works over a very wide 
range of applications, increasing the produc¬ 
tion volume and lowering the costs for every¬ 
one. (See Node interface structure.) 

Point-to-point signaling is much easier than 
bus signaling. This, in combination with SCI’s 
low-current, single-voltage (48V) power dis- 


Transaction phases 

As seen in Figure B, SCI transactions handshake on a packet basis 
rather than on a bus cycle basis. A request packet contains all the 
necessary address, command, and possibly data needed to initiate a 
transaction. The packet is either accepted and stored in the responder's 
queues or (if there is insufficient space) is discarded. An echo tells 
the requester whether it can discard its send packet or must retransmit 
later because it was not accepted. A similar handshake occurs on 
the response. 
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Figure B. Transactions have requests and response subactions. 
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Node interface structure 


Data arriving on the incoming link shown in 
Figure C have to be resynchronized to the node’s 
own clock. The receiver’s elastic buffer circuitry 
handles this step. The rest of the node circuitry 
is synchronous, which greatly simplifies the de¬ 
sign of the ultra fast FIFOs and logic. 

If a node receives a packet intended for some 
other node while it is transmitting a packet of its 
own, part or all of the incoming packet moves 
to the bypass FIFO. 

When no packet is being received, idle sym¬ 
bols keep the receiver synchronized and carry 
information about the priorities of other nodes. 
They also carry Go bits, which act like tokens 
and help ensure fair use of the links. Fair use of 
part of the bandwidth is important in avoiding 
starvation or deadlock. 

An SCI node maintains dual queues to keep 
responses independent of requests. Without dual 
queues, excessive requests could prevent the 
sending of responses, resulting in deadlock. 



Figure C. Block diagram of a typical SCI node. 


tribution system, should make ring-connected backplanes very 
inexpensive. A few printed-circuit-board layers will generally 
be enough. 

Intermediate-level SCI systems may use a combination of 
switch and ring, using paired SCI interfaces as a simple bridge 
that connects two rings. By combining many rings, designers 
can build a distributed switch fabric. Active research is in 
progress to discover optimal configurations. 2,3 Scott and 
Goodman 2 show that when pipelining is permitted, as is the 
case for SCI, we gain performance rapidly relative to syn¬ 
chronous systems by increasing the dimensionality of the 
interconnect. The resulting performance is much higher than 
for packet-synchronous systems. 

SCI defines efficient packet-based protocols that provide 
the kinds of services we expect from a computer bus. The 
main differences seen by the user relate to the separation of 
the request for service from the response. A simple bus waits 
during a memory read access time, until it gets the data. No 
other parties can use the bus during that wait. This approach 
makes it conceptually easy to handle error conditions or to 
perform complex mutual-exclusion operations (like read- 
modify-write). See the Transaction formats box. 

More sophisticated buses split the operation into a request 
and a response phase, just as SCI does. Then the interface 
must keep track of pending requests to match the responses 
to them, and a different style of mutual exclusion becomes 
necessary. Engineers who have already used split-response 


Transaction formats 

SCI packets have a 16-byte header that contains ad¬ 
dress, command, transaction identifier, and (in a re¬ 
sponse) status information. All packets that may need 
storage space in queues are multiples of 16 bytes, to 
simplify storage management at veiy high speeds. The 
echo packet is an 8-byte subset of the header (not shown 
in Figure D); it is never stored in queues. A few transac¬ 
tions need an extended header (not shown either), which 
adds another 16 bytes. 


readxx* 

writexx* 

movexx* 

locksb 


Request 


Header 


Header 

16,64,256 

Header 

0,16,64,256 

Header 

16 


Response 


Header 

0,16,64,256 

Header 



Header 


16 


* xx represents one of the allowed data block lengths 
(number of data bytes, on the right after the header). 


Figure D. SCI packets. 
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buses will find SCI to be clean and simple; those who haven’t 
may face a learning curve. 

Solving the cache coherence problem. Cache memo¬ 
ries are important for keeping processors running at full speed, 
but they introduce some system management complexity. 
The various caches often contain duplicate copies of data 
that must be kept consistent with one another, the cache 
coherence problem. Say that two processors read the same 
data (perhaps a synchronization flag) from memory and cache 
it (perhaps on chip), and then one changes the data (per¬ 
haps to free a resource that the other is waiting for). Some¬ 
how the other processor must discover that its cached copy 
is invalid so that it can be updated with current information. 

Designers have maintained cache consistency, or coher¬ 
ence, in small bused systems by taking advantage of the bus 
bottleneck—every cache controller observes every transac¬ 
tion in the system, “snooping” to catch transactions that might 
invalidate cached data. That approach can be scaled up to a 
few buses by making the bridges rather sophisticated, but it 
does not work in highly parallel systems. 

A cache coherence scheme that scales to large systems 
requires a directory that keeps track of which data are being 


used by which caches, so that the appropriate caches can be 
updated as necessary. Earlier schemes used a directory but 
kept it in memory. Instead, SCI maintains it as a distributed, 
doubly linked list of caches, with the head pointer at a memory 
controller and the link pointers stored in the cache control¬ 
lers. With this approach we know that the correct amount of 
storage is always available for the directory structure, no matter 
how many caches are sharing copies of a particular line of 
data. This approach also spreads the maintenance traffic across 
the system, rather than concentrating it at the memory. Even 
though these linked-list structures are shared, we designed 
them to be updated concurrently by multiple processors with¬ 
out any semaphores or lock variables. The protocols use in¬ 
divisible compare-and-swap transactions instead. (See the 
Distributed cache tags box.) 

We can avoid the cache coherence problem if memory is 
not shared. Most of the early multiprocessors, which rely on 
message passing and explicit interprocessor communication, 
use this approach. However, it is often difficult to move com¬ 
puter programs from single-processor machines to message¬ 
passing multiprocessors. Note that shared-memory machines 
can easily pass messages, so they are more versatile. The 


Distributed cache tags 

SCI is a distributed system with many links carrying 
data independently and concurrently. Only directory- 
based cache coherence is practical in large high-perfor¬ 
mance systems of this kind. The directory keeps track of 
which caches are sharing each cache line of data, so that 
the right caches can be notified when their copy be¬ 
comes invalid. Rather than centralizing the coherence 
directory (in RAM), SCI distributes it among those cache 
controllers that are sharing the data. 


Processors 

Head Mid Mid Tail 



E unit 
Cache 


Figure E. A typical linked list for one cache line. 


The SCI coherence directory always has storage avail¬ 
able in exactly the amount needed, because the cache con¬ 
trollers sharing the given cache line provide it. The arrows 
in Figure E represent bidirectional pointers forming a dou¬ 
bly linked list. These pointers and a few status bits consti¬ 
tute the directory entry for this cache line. 

The doubly linked list structure makes it possible for 
any cache to roll out a cache line to make room for new 
data, removing itself from the list by telling its neighbors to 
point to each other. This list maintenance traffic is distrib¬ 
uted throughout the system, greatly reducing the concen¬ 
tration of traffic that is typical of centralized directory 
schemes. 

The protocols support optional optimizations for impor¬ 
tant cases like pairwise sharing and serial resource alloca¬ 
tion. Pairwise sharing lets two sharers pass data back and 
forth as needed without involving memory. Serial resource 
allocation (queue on lock bit) uses the list structure to pass 
the resource along without needless communication in the 
interconnect. 

Note that the directoiy entry is a shared data structure 
that may be concurrently accessed by multiple processors. 
The SCI protocols rely on atomic compare-and-swap op¬ 
erations to ensure correctness without using semaphores 
or lock variables, which would be less efficient and intro¬ 
duce management complexity. For example, what do you 
do when a processor sets a lock variable and then dies? 
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Cache coherence 


Until recently, microprocessor designers considered the 
use of a cache to be their own decision, based on cost/ 
performance objectives, with little concern for the needs 
of multiprocessors. With just one processor, the existence 
of a cache usually affects its perfonnance but not the cor¬ 
rectness of its execution. A common exception occurs in 
the case of self-modifying code, where the processor ex¬ 
ecutes instructions from the cache that are writing to 
memory in a futile effort to modify themselves. 

However, in a system with two or more processors each 
of which has a cache, serious problems can result if the 
caches are not properly designed to work in harmony. For 
example, suppose two processors use a variable to keep 
track of which one is using the printer, and the software 
waits until no one is using it before starting a new print 
job. If two primitive processors read the same variable and 
each keeps it in its own cache, they each read only their 
cached copy, not any changes that the other processor 
may make to the variable. Designers must either arrange 
that such variables are never stored in caches or design the 
cache control so that each cache discovers when its copy 
of the variable is no longer correct, discards the old (in¬ 
valid) value, and obtains a fresh copy. 

Such mechanisms add cost and complexity, so it is not 
surprising that they were omitted in early caching 
(uni)processors. We call keeping all the cached copies of a 
data item consistent the cache coherence problem. 

The cache controller can determine that its copy is in¬ 
valid either by tracking every transaction in the system that 
might be capable of changing the data or by being notified 


by a reliable source. Bused multiprocessors usually use 
snooping to maintain coherence. Each cache monitors the 
bus at all times, which keeps it fairly busy since it also has 
to be handling the memory requests from its processor. 
When the cache controller identifies another processor writ¬ 
ing to the address of data it has cached, it can either pick 
up the new value on the fly (“snarfing”) or mark the old 
one invalid and go to memory for a good copy when (and 
if) the processor needs it later. Say a cache can verify that 
it has the only good copy—because it changed the data— 
and it sees another processor trying to read the data from 
memory. It must then intervene to prevent the other cache 
from accepting a stale copy. Either the cache can supply 
the data itself or it can abort the transfer, update memory 
with the latest copy, and then let the transfer be retried. 

The snooping mechanism relies on every controller be¬ 
ing able to track every transaction. That’s acceptable on a 
bus, because only one transaction can happen at a time, 
and each controller tracks it. With effort and care, snoop¬ 
ing can even work in a system with several buses. But in a 
highly parallel system, like SCI, the cost of snooping is 
intolerable; too many transfers happen at the same time 
and in different places. 

The usual solution to this problem is to maintain a direc¬ 
tory that tracks which caches have copies of each cache 
line. (A cache line is the amount of data tracked as a single 
entity by the cache controller. SCI tracks 64-byte lines; ear¬ 
lier systems usually used smaller line sizes.) 

The next decision is where to keep the directory and 

continued on p. 16 


goal, then, is to make the coherence protocols efficient and 
economical. 

See the Cache coherence box for further amplification of 
the problem. 

Multiprocessor issues. SCI designers placed a high pri¬ 
ority on eliminating several architectural problems, such as 
deadlocks and livelocks, that have plagued past multiproces¬ 
sors. Deadlocks can occur when two processors request the 
same two resources in the opposite order. Each processor 
may access one resource (blocking the other processor) then 
become blocked by failure to access the other resource. 
Livelock, or starvation, occurs when one processor repeat¬ 
edly accesses a shared resource without letting another have 
its turn for an arbitrarily long time. Deadlock and livelock are 
examples of what we generically refer to as “forward progress” 
issues. To guarantee forward progress, each operation should 
result in some useful work done, so that the system will not 


consume all its resources performing useless retries. 

Experience with multiprocessor systems shows that they 
tend to synchronize themselves around problems like these. 
Even though the designer has estimated the probability of 
livelock to be extremely small, its effect on the system once it 
occurs is such that it tends to cause frequent repetitions. 

Therefore, we designed the SCI protocol to eliminate de¬ 
pendencies that can cause deadlocks and to include a simple 
mechanism that assures fair allocation of resources to avoid 
livelocks. In a few cases, deadlock avoidance seems to make 
the protocol less efficient (until we consider the indirect con¬ 
sequences!). But in most cases choosing a clean protocol 
resulted in no penalty. 

Of course, software that is outside SCI’s scope can still 
create deadlocks. The SCI designers could only eliminate 
deadlocks from the underlying protocols. 

SCI includes efficient mechanisms for supporting shared 
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resources, shared data structures, and mutual exclusion. Since 
the old standby read-modify-write is impractical in a parallel¬ 
processing environment, SCI designers took a fresh look at 
the problem and incorporated several mechanisms that will 
be helpful. 

In a cache-coherent environment, the mechanism that guar¬ 
antees exclusive use of a cache line by a writer can handle 
mutual exclusion. The processor can temporarily lock an ex¬ 
clusive cache line while it performs any operations it desires 
then release it to other readers or writers. However, coher¬ 
ence may not be available in all cases. For example, when 
accessing (through a bridge) a bus that does not support 
cache coherence, one may need to tell the bridge to perform 
a read-modify-write. 

Therefore, SCI defines a set of lock primitives that experi¬ 
ence shows to be useful for a variety of purposes: masked 


swap, compare-and-swap, and fetch-and-add. Each of these 
primitives sends the command and necessary data to a desti¬ 
nation device (perhaps a bridge or a memory controller). 
That device performs the operation indivisibly, returning the 
result to the requester. This mechanism works well through 
switches or other interconnects. (It even works well on buses.) 

A particularly interesting use of the swap operations is in 
the maintenance of shared lists that hold information for an 
intermpt-servicing processor or commands for a DMA con¬ 
troller. If these lists are properly structured, multiple proces¬ 
sors can add items to them while one processor takes items 
off, without any need for lock variables. 

A clean way to handle interrupts is to associate a particular 
shared list with one priority of interrupt service and a specific 
bit in an interrupt-triggering control register. To request in¬ 
terrupt service, an I/O device or processor adds a work item 


Cache coherence (continuedfromp. 15) 

how to organize it. One way is to keep somewhere in 
memory a bit map for each cache line, setting a bit corre¬ 
sponding to a particular cache when that cache comes to 
the memory for the data. But in a large system, this would 
require too many bits. Using thousands of bits to account 
for the location of one cache line isn't acceptable. The 
next refinement might be to make a list in memory, keep¬ 
ing track of the identifiers of the caches that have taken 
copies. But while most cache lines are shared by perhaps 
only one or two caches, some might be shared by every 
cache in the system. Thus the storage for the worst case list 
becomes impossibly large. Designers have used a variety 
of clever compromises, such as allocating a small amount 
of list storage and then assuming the worst when it runs 
out. In that case, they assume that every cache in the sys¬ 
tem has a copy and must be notified when the data changes. 

SCI features a very general and scalable directory struc¬ 
ture. Memory controllers keep (and store in special memory) 
one node pointer and a few state bits for each cache line 
they have. That pointer points to the head of a list. The 
address of the data in a read transaction routes the read to 
the memory and tells the memory which cache line is de¬ 
sired. The read transaction includes the node identifier of 
the requester. The memory controller exchanges that iden¬ 
tifier with its pointer for that cache line, returning the old 
pointer value to the read requester. If the memory has a 
current copy of the data, it returns the data to the requester 
too. Otherwise the memory informs the requester that the 
data is somewhere else, in another cache, and it can use 
the returned pointer to find it. 

SCI cache controllers keep in their tag memory storage, 
for each line, the memory address of the line (like all cache 


controllers have to do), a few state bits, and two pointers. 
The pointers form a doubly linked list of those nodes that 
have the particular line in their caches. 

There are two particularly good properties of this scheme. 
First, it scales properly. No matter how many caches or 
how much memory or what the sharing behavior happens 
to be, exactly the right amount of storage is always avail¬ 
able, namely two pointers per cached copy plus one pointer 
per cache line in memory. 

Second, maintaining this list is a distributed process. Only 
one transaction touches the memory for each request. All 
the rest (following pointers, adding itself to the list, remov¬ 
ing itself when its cache overflows and it needs to roll out 
a cache line to make more room) involve transactions 
among the distributed caches, which do not contribute to 
traffic congestion at the memory. 

Though the bus-snooping mechanism may seem con¬ 
ceptually simpler, keep in mind that it bears a high price. It 
prevents us from having thousands of transactions pro¬ 
ceeding at once. Furthermore, every cache in a snooping 
system must participate in every address cycle. Thus, ev¬ 
ery cache must be very fast so that participation does not 
slow the system too much, because the slowest one sets 
the speed for all. 

In the SCI scheme the cache controller acknowledges a 
packet’s arrival and checks it when it is convenient. If the 
controller is slow, it only slows the transactions in which it 
participates, not all transactions. As designers add new, 
higher performance, caches to a system, they get a propor¬ 
tionate performance improvement. This is another desir¬ 
able kind of scaling behavior; designers don’t have to discard 
old cache controllers to get the benefit of new ones. 
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Interrupts 


Interrupts are simple in a single-processor system, be¬ 
cause it is clear which processor should perform the re¬ 
quested service. 

Some simple systems just use one signal line to trigger 
an interrupt. The interrupt causes the processor to save its 
state on the stack or in a duplicate set of registers, then 
execute software that asks all possible interrupt sources 
whether they need service—interrupt-driven polling. 

More sophisticated systems allow the interrupting de¬ 
vice to identify itself. Some kind of arbitration determines 
which source has the right to put its identifying vector on 
the bus during the interrupt-acknowledge cycle. The vec¬ 
tor generally changes into an address at which the corre¬ 
sponding service code starts in memory. The arbitration 
mechanism varies from system to system. It may use a 
central method, so that individual signal lines connect to 
each interrupt source, or a distributed method, using a 
daisy chain or some other arbitration mechanism. Often a 
combination of polling and vectoring is used. For example, 
the service routine might poll a list of devices known to be 
connected to interrupt line 3. 

These mechanisms are adequate for finding the source of 
an inteirupt but do not deal at all with the problem of deter¬ 
mining which of several processors should service it. 

Systems with multiple processors require a more gen¬ 
eral mechanism. The most common solution provides a 
special control register associated with the interrupt sys¬ 
tem of each processor. To generate an interrupt, a device 
must become the bus master and write to the interrupt 
register located at the address corresponding to the pro¬ 
cessor to be interrupted. Becoming a bus master and ex¬ 
ecuting a write may require additional hardware in the 
interrupting device. So, some multiprocessor systems 
(Fastbus, IEEE Std 960) also provide a hybrid capability 
that allows a simple device to assert a signal that causes a 
shared intermediary device to (possibly poll and then) 
perform the write for it. 

It is tempting to design the interrupt registers to accept 
vector information directly. A device might write its as¬ 


signed identifier into the register for use much like the 
vector used in simple interrupt systems. This temptation 
should be avoided, because there are hidden perils. For 
example, what happens if a second interrupt-write arrives 
soon after the first? Either it must be stored, implying a 
FIFO, or it has to be aborted and retried later. Any FIFO 
has a finite capacity, however, so under adverse condi¬ 
tions it might not be capable of holding any more vectors. 
But aborting the write and retrying also causes trouble, 
because it can lead to deadlocks. For example, suppose 
the interrupt service routine of one processor must send 
an interrupt to another, and the other processor has a ser¬ 
vice routine that has to interrupt the first. When both pro¬ 
cessors' FIFOs fill for some reason, deadlock ensues because 
neither can do anything except continue to retry sending 
the interrupt. 

We can argue that these conditions are anomalous and 
designing the hardware to have large enough FIFOs will 
make the problem unlikely to occur. However, experience 
shows that real multiprocessor systems seem to seek out 
these trouble spots. 

A clean solution puts the burden of allocating resources 
on the interrupt requester, so that it cannot ask for an 
interrupt unless it has the resources to ensure the interrupt 
request can be delivered. Suppose the requester allocates 
a block of memory to hold its vector and possibly other 
service information, links the block into a list of service 
requests, and then writes a pulse to the interrupt service 
register. Now, no possibility of being blocked exists, and 
no deadlock can occur. 

We can design the service-request list so that any num¬ 
ber of interrupt requesters can concunently add their blocks 
to the list while one server removes blocks to service them. 
We will not need semaphores or lock variables, if the sys¬ 
tem supports indivisible swap and compare-and-swap trans¬ 
actions. ‘ These transaction types are extremely valuable in 
multiprocessor systems and deserve wide support by new 
processor architectures. 


to the list describing what needs to be done, then sets the 
appropriate bit in the interrupt register. The item in the list is 
similar to the interrupt vector in single-processor systems. 
When the processor is ready to service that priority of inter¬ 
rupts, it takes work items off the list and services them. 

This mechanism is clean because the service requester al¬ 
locates all storage, so there is no danger of the server ruli¬ 
ning out of storage when the work piles up (a possible cause 
of deadlock). The interrupt bit is simple to implement be¬ 


cause it is only a latch, with no FIFO storage or critical timing 
implied. Setting the bit merely alerts the processor that the 
list should be checked; the processor can clear the bit as 
soon as it commits to look at the list. (See the Interrupts box 
for more information.) 

A common use of mutual exclusion is to grant sequential 
use of one resource to a series of requesters. This process can 
generate a large amount of useless interconnect traffic as pro¬ 
cessors keep updating their cached copies of the shared ex 
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elusion variable. SCI defines a queue-on-lock-bit, or QOLB, 
mechanism that uses the linked lists of the cache coherence 
system to pass the resource efficiently from one processor to 
the next. 

Statistics on memory sharing are controversial, because few 
relevant machines exist and because the usage patterns on 
any real machine will evolve to optimize performance on that 
architecture. However, most researchers agree that unshared 
data is the most common case, followed by pairwise shared, 
followed by multiply shared. Thus the pairwise sharing case 
may be an important one to optimize. SCI defines optional 
pairwise sharing optimizations that allow two processors to 
pass data back and forth without interacting with memory, 
thus distributing system traffic and reducing activity at the 
memory controllers. 

We made significant changes in the initialization model and 
in the executable C code that embodies the detailed specifica¬ 
tions. In particular, we changed compile-time options to 
runtime options so that the same code could be used for test¬ 
ing the interaction of nodes implementing different option sets. 
Also, we significantly improved bit-serial link specifications 
using material supplied by the Serial HIPPI working group. 

The revision process converted all but one of the seven 
negative votes to affirmative by responding to the concerns 
expressed. As mentioned earlier, the one remaining negative 
vote can only be changed by converting the C code to Pascal, 
which would be unacceptable to the working group. 

We redistributed the resulting Draft 2.0 to the balloting body 
in December 1991 and expect that the standard will receive 
final IEEE approval early in 1992. Supporting chips should be 
available within a few months. Dolphin SCI Technology of 
Oslo, Norway, will offer single-chip SCI interfaces that incor¬ 
porate transceivers, EIEOs, and cache coherence support. Sev¬ 
eral versions are planned, for use as a processor, a memory, or 
an I/O interface. Hewlett Packard is preparing parallel/serial 
converter chips to interface the Dolphin chips (parallel) to the 
1,000-Mbps serial encoding used on optical fiber or coaxial 
cable. These chips should be available early in 1992. 

Related standards projects 

The following projects are either used by SCI or form a 
part of ongoing work related to future SCI developments. 

IEEE Std 1212. The Control and Status Register Architec¬ 
ture standard defines the I/O architecture for SCI, Euture- 
bus+ (IEEE Std 896.1 and 896.2-1991), and Serial Bus (P1394). 
David V. James, Apple Computer, 20525 Mariani Ave., 
Cupertino, CA 95014, phone 408-974-1321, fax 408-974-0781, 
dvj@ apple.com, chaired the group. Voters approved the draft 
standard, which was then modified in response to ballot com¬ 
ments. The balloting body received a second and final 
recirculation, and the IEEE Standards Board granted final ap¬ 
proval in December 1991. 

IEEE Std 1301. The Metric Equipment Practice for Micro- 


We expect SCI, Draft 2.0, to 
receive final IEEE approval early 
in 1992 and supporting chips to 
be available within a few 
months. 


computers—Coordination Document is an approved stan¬ 
dard (June 1991). This specification defines the generic met¬ 
ric modular packaging family used by the SCI module. Hans 
Karlsson, Ericsson Telecom AB, TN/ETX/T/F, Stockholm, S- 
126 25 Sweden, phone +46-8-719-6037, fax +46-8 719 8282, 
chaired the group. 

IEEE Std 1301.1. The Detailed Standard for a Metric Equip¬ 
ment Practicefor Microcomputers Using 2-mm Connectors and 
Convection Cooling is also approved (June 1991). This specifi¬ 
cation details the specific subset of IEEE Std 1301 that is used 
by the SCI module. Hans Karlsson served as chair. EIA IS-64, 
February 1991, presently defines the 2-mm connector, but it 
will become an IEC standard soon. (This connector family is 
sometimes called Metral, which is DuPont’s trademark.) 

PI394. The Serial Bus working group is developing a high¬ 
speed (10-20 Mbytes/s) serial bus that can be used for low- 
cost diagnostics (supporting IEEE Std 1149.1 Boundary scan 
Architecture in large systems) and I/O. The group aims at a 
very low cost ($15 per connection, including cable, connector, 
and interface) bus for use in consumer products. SCI includes 
a Serial Bus connection in the module power connector. 
Michael Teener, Apple Computer, 3535 Monroe St., Santa 
Clara, CA 95051; phone 408-974-3521, fax 408-985-9893, 
teener@apple.com, chairs the group. Work on this standard 
should complete in 1992. 

PI 596.1. The SCI/VME Bridge project is defining a bridge 
architecture for interfacing VMEbuses to an SCI node. This 
project provides I/O support for early SCI systems via VME. 
Products are likely to be available in 1992. Bjorn Solberg, 
CERN, CH-1211 Geneva 23, Switzerland, phone +41-22-767- 
2677, fax +41-22-782-1820, bsolberg@dsy-srv3.cem.ch, chairs 
the group. 

The project’s main decisions involve the mechanism for 
mapping addresses between VME and SCI, which versions of 
VME to support, and how to handle interrupts, mutual exclu¬ 
sion, and cache coherence. 

P1596.2 T he Cache Optimizations for Large Numbers of 
SCI Processors project is developing request combining, tree- 
structured coherence directories, and fast data distribution 
mechanisms needed for systems with thousands of proces- 
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sors, compatible with the base SCI coherence mechanism. 
Ross Johnson, Computer Science Dept., 1210 Dayton Street, 
University of Wisconsin, Madison, W1 53706, phone 608-262- 
6617, fax 608-262-9777, ross@cs.wisc.edu, chairs this group. 

This working group is developing and extending ideas 
that came up during the development of the base SCI stan¬ 
dard, but which it felt could be postponed to avoid delaying 
SCI’s introduction. That is, the protocols defined by P1596 
seem adequate for systems with perhaps hundreds of pro¬ 
cessors (enough for a year or so) and include hooks for add¬ 
ing these optimizations later. 

When a large number of requests is addressed to the same 
node, the interconnect becomes congested and performance 
suffers. Certain kinds of requests, such as reads and fetch- 
and-adds, may be combined in the network to reduce this 
congestion. Request combining allows several requests to be 
combined into one when they meet (waiting in queues in 
the interconnect). All but one of these requests generate an 
immediate response, which tells the requester to get the data 
from that one’s cache instead. 

SCI nodes handle this kind of no-data response already, 
because it is used in the basic coherence protocol. The re¬ 
maining request goes forward, eventually resulting in a re¬ 
sponse that provides the data to that cache, where the other 
nodes read them. This design spreads out the traffic, reduc¬ 
ing congestion. 

Note that these immediate responses relieve the intercon¬ 
nect from retaining any information about the combining, 
which enormously simplifies the process compared to previ¬ 
ous implementations. 

Once the data become available, the time needed to dis¬ 
tribute them to all the requesters becomes important. The 
linear linked lists of the base SCI standard result in times 
proportional to the number of requesters, which can be a 
performance problem in large systems. 

Instead of linear lists, the group would like to maintain a 
tree structure, which could distribute the data in time propor¬ 
tional to the logarithm of the number of requesters. 

At first, group members thought it would be impractical to 
maintain binary trees in the distributed coherence directory 
because the overhead would be too high. The schemes they 
had seen others use were not acceptable for SCI. These in¬ 
volved setting lock variables to get mutual exclusion while 
tree maintenance was done, thus scaling poorly and violat¬ 
ing an SCI design principle. (Lock variables also introduce a 
variety of complications, such as what to do when the pro¬ 
cess that holds the lock fails.) 

Thus the group first considered using approximate or tem¬ 
porary pointers, which would form short cuts along the lin¬ 
ear directory lists but gradually become inaccurate as 
processors rolled out cache lines and so on. Whenever a 
temporary pointer was used, it would be checked for valid¬ 
ity, and the algorithm would drop back to following the (al¬ 


ways valid) linear list when necessary. 

But at the August 1991 meeting, Ross Johnson presented a 
method for maintaining correct trees at all times, without 
using lock variables and without adding much overhead. 
Though details need to be worked out and some corner cases 
need more study, the group feels the remaining questions 
can be resolved. 

Open issues concern the worst case scenarios (when things 
happen in the worst possible sequence) and how to reduce 
the likelihood of having these occur. 

PI 596.3. The Low-Voltage Differential Signals for SCI 
project specifies low-voltage differential signals suitable for 
high-speed communication between CMOS, GaAs, and 
BiCMOS logic arrays used to implement SCI. The object is to 
enable low-cost CMOS chips to be used for SCI implementa¬ 
tions in workstations and personal computers, at speeds of at 
least 200 Mbytes/s. Gaty Murdock, National Semiconductor, 
642 Pineview Drive, San Jose, CA 95117, phone 408-721- 
7269, fax 408-721-7218, chairs this group. 


Five current projects 
support future SCI applications. 


Faster signaling requires smaller signals, if edge rates and 
currents are to be kept reasonable. Smaller signals require 
differential signaling (or at least their own reference inde¬ 
pendent of system ground). At first glance, differential signal¬ 
ing seems to cost a factor of two in signal traces and pins. But 
the real cost is much smaller because far fewer ground pins 
are needed, far less system noise is created (or picked up), 
and the higher signaling speeds reduce the number of paral¬ 
lel signals needed. 

SPICE modeling shows that we can signal at SCI speeds (2 
ns/bit/signal pair) with contemporary CMOS technology. 
MOSIS (a fast-turnaround prototype chip fabrication service) 
test chips have already reached nearly this performance. In 
fact, the hardest part of the problem is how to provide or 
accept the data at the signaling rate! 

The group bases present modeling on a 250-mV voltage 
swing, centered on IV. This approach provides a little head- 
room for common-mode rejection at the receiver, while al¬ 
lowing use of 5V, 3.3V, and eventually 2V technologies. 

The working group is choosing certain signal levels and 
rates to be supported as signal interchange (link) standards 
for CMOS implementations of SCI. It will also define an 8-bit 
and possibly a 4-bit link to complement the Std 1596-defined 
16-bit and 1-bit links. This work should complete in 1992. 

PI 596.4. The High-Bandwidth Memory Chip Interface 
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As technology advances, 
SCI will define new 
physical link standards 
for higher performance 
or lower cost. 


project defines an interface that will permit access to the 
large internal bandwidth available inside dynamic memory 
chips. The goal is to increase the performance and reduce 
the complexity of memory systems by using SCI signaling 
technology and a subset of the SCI protocols. This work was 
started by Hans Wiggers of Hewlett Packard Laboratories, 
and I now chair it. (My address appears at the end of this 
article.) 

A serious problem with present memory systems is the 
need to use a large number of memory chips in parallel banks 
to get the bandwidth needed for today’s powerful micropro¬ 
cessors. As the capacity per chip increases, the smallest 
memory configuration with adequate bandwidth reaches a 
point where it has an unreasonably large capacity (and need¬ 
lessly high cost). Furthermore, the increments for expansion 
are too large. We hope to get much higher bandwidth from 
far fewer chips by using SCI signaling technology. This ap¬ 
proach will lower the entry cost for low-end systems and 
raise the performance of high-end systems. 

Designers are considering several models. 5 One of the most 
promising uses several RAM chip ringlets attached to a single 
controller by 8-bit-wide point-to-point links. The name RAM 
Link is becoming popular for this approach. Details of the 
signaling are still being worked out, but there seems to be 
general agreement to use small signal voltages. 

Other issues include whether to perform ECC (error check¬ 
ing and correction) in each RAM chip or in the controller. 
Including it in the RAM chip leaves the details up to the 
vendor, with possible future technology improvements be¬ 
ing incorporated transparently. Including it in the controller 
is probably the lowest initial cost and gives the system de¬ 
signer the most control. But ECC involves a performance 
penalty for transmitting the extra information on the links. 
We could increase link performance by using a 9-bit-wide 
link, but that would cost pins and power. 

Another open question concerns the link protocol to be 
used. If an SCI bypass FIFO can be incorporated on the 
RAM chips, we can use a simplified version of the present 
SCI protocols. If not, we must define a scheduling mecha¬ 
nism that prevents two packets from being transmitted at 


once. Predictability issues, such as the effects of refresh or 
ECC soft-error correction on the RAM chip, complicate sched¬ 
uling. However, in the November 1991 meeting the group 
proposed a “designated token” scheduling mechanism that 
looked very promising. Draft 0.11 was presented at the January 
meeting. 

P1596.5. The Shared-Data Formats Optimized for SCI 
project specifies data formats for efficiently exchanging data 
between byte-addressable processors on SCI. SCI supports 
efficient data transfers between heterogeneous workstations 
within a distributed computing environment. Current systems 
require conversions among large numbers of vendor- or 
language-dependent data formats; specifying a single trans¬ 
fer format greatly reduces the complexity of this conversion 
problem. In addition to simplifying the data-interchange prob¬ 
lem, standard data formats provide a framework for the de¬ 
sign of future processor instruction sets and language data 
types. David V. James, Apple Computer, chairs this group. 

The specification defines integer and floating-point sizes, 
formats, and address-alignment constraints. It supports bit 
fields as subcomponents of a larger byte-addressable integer 
datum. Work on this project has just begun, but it should not 
take long to complete, since much of the groundwork has 
been done earlier in conjunction with development of the CSR 
Architecture. 6 Draft 0.50 was presented at the January meeting. 

Future plans 

As technology advances, SCI will need to define new link 
standards. For example, present fiber optic technology is 
currently expensive at 1,000 Mbps and prohibitive at higher 
rates. Yet this situation changes rapidly, and SCI applications 
in high-definition television would be greatly helped by a bit 
rate at least double this. We will monitor progress in this 
area. 

Similarly, SCI offers enormous application possibilities for 
slower links. An 8-bit-wide link operating at 250 Mbytes/s or 
500 Mbytes/s might be about right for the next-generation 
personal computers. Perhaps it will be appropriate soon to 
define a personal computer form factor and signal standard 
for SCI, based on new CMOS chips or processors with inte¬ 
grated SCI interfaces. 

While designing SCI, we learned a lot about how the pro¬ 
cessor should interact with the interconnect. We are consid¬ 
ering how best to spread this information to processor 
designers, to make multiprocessor systems more efficient. 
Possibly, this could be a recommended practice, or even a 
standard, clean, 64-bit RISC architecture optimized for use 
with SCI. 

We have in mind two projects for bridges to Futurebus+ 
that are currently waiting for the right time to start. One is a 
very simple bridge to Profile B, an I/O bus with no cache 
coherence. The other is a general symmetric bridge that in¬ 
cludes cache coherence. 
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The BASE SCI STANDARD COVERING THE physical sig- 
naling, logical protocols, and cache coherence mechanism 
should be approved by the IEEE Standards Board in March 
1992. The first commercial implementer (Dolphin SCI Tech¬ 
nology, Oslo, Norway) expects to have working prototypes 
within a few months of the standard’s approval and has prom¬ 
ised to make the interface chips available to others. 

SCFs performance seems such a large step ahead of the 
current state of the art in computer buses that it has some 
difficulty in appearing credible. The best answer to these 
doubts will be the existence of working silicon, available at 
reasonable prices, being used in working systems. 

The complexity of SCI is approximately the same as that of 
a split-cycle bus system (like the VAX BI or Futurebus+ or 
Fastbus with Buffered Interconnects) in small applications. It 
is much less than that of a bus system for large applications. 
Nevertheless, relatively few designers have experience with 
split-cycle bus design issues, and therefore we foresee some 
need for training as they move to SCI. 

SCI has no evident competition in terms of open systems 
that could hope to deal with the massive computation needs 
for the next generation of data acquisition, analysis, and 
general computation. Nor is there any competitor that spans 
SCI’s whole application range: campuswide optical LAN, desk¬ 
top workstation bus, network shared server, I/O interface, 
data acquisition system, highly parallel multiprocessor, 
supercomputer. 

For details, or to participate in this work, please contact 
me. (P 
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Unix and the Am29000 Microprocessor 


Though targeted for use in medium- to high-performance embedded applications, the Am29000 
includes several design provisions allowing its use in Unix workstation applications. For 
example, a scalable interrupt-handling mechanism makes the processor particularly suited 
to real-time Unix operations. The relevant features discussed here will help users interested 
in building a Unix system. 


Daniel Mann 

Advanced Micro Devices 



se of RISC processors in medium- and 
high-performance embedded applica¬ 
tions is growing rapidly. As prices fall, 
it seems likely that RISC machines will 
dominate CISC processors as the system of choice 
for an even wider range of new embedded sys¬ 
tem designs. A number of companies are already 
manufacturing processors using RISC principles 
that offer a better price-performance ratio than 
the top-end CISC devices. 

Many designers have chosen to use Advanced 
Micro Device’s Am29000 processor in complicated 
embedded applications. The increased use of 
higher level languages and real-time operating 
system support services with embedded RISC 
applications requires engineers to know more 
about processor features previously more widely 
used in workstation applications, such as Unix. 

Unix was developed on small computers and 
it makes modest demands on a host processor. 
However, for good perfonnance Unix requires 
the processor to provide certain basic facilities. 
The Am29000 has several features that make it a 
particularly suitable Unix host. For example, the 
architecture is scalable yet has features for main¬ 
taining code compatibility. 

Our processor’s interrupt-handling mechanism 
services devices without incurring a built-in ex¬ 
ception-processing sequence. This is of particu¬ 
lar interest to implementers of Unix systems that 
are also constrained with real-time events. Users 
are free to design the necessary interrupt-handling 
processor environment for maximum efficiency. 

Among the Am29000’s features of particular 


interest to designers of high-performance Unix 
systems are 

• data and instruction caching, 

• memory management, 

• multiple data transfer instructions, 

• freeze-mode operation, 

• multiprocessor support, 

• floating-point support, and 

• flexible memory systems. 

A detailed discussion of the Am29000’s archi¬ 
tecture 1,2 is beyond the scope of this article, as is 
a discussion of Unix internals. 3,4 Rather, I point 
out features that make our processor a good Unix 
host. 

C calling sequence 

Making a subroutine call on a processor with 
general-purpose registers is expensive in terms 
of time and resources. Because functions must 
compete for register use, registers must be saved 
and restored through register-to-memory and 
memory-to-register operations. For example, a C 
function call on Motorola’s MC68000 processor 
(see Table 1 on page 24) might use the statements 

char bits8; 

short bits 16; 

printf(“char= %c short=%d”, bits8, bits 16); 

After they are compiled, they generate the fol¬ 
lowing assembly-level code: 


0272-1732/92/0200-0023$03.00 © 1992 IEEE 
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To reduce future access delays, the 
system normally copies data to gen¬ 
eral-purpose registers before using it. 
For instance, using a memory-to- 
memory operation when moving data 
from the local frame of the function 
call stack would reduce the number of 
instructions executed. However, these 
are CISC instructions that require sev¬ 
eral machine cycles before completion. 

In the example, the C function call 
passes two variables, bits8 and bitsl6, 
to the library function printfC ). The fol¬ 
lowing assembly code shows part of 
the printfC ) function for the MC68000: 

_printf: 

LINK A6, #-32 ;local variable 
space 

LEA 8 [A6], AO ;unstack string 
pointer 

UNLK A6 

RTS 

Several multicycle instructions are re¬ 
quired to pass the parameters and es¬ 
tablish the function context. Unlike the 
variable instruction format in the 
MC68000, the Am29000 has a fixed 32- 
bit instruction format (see Figure 1). 
The same C statements compiled for 
the Am29000 (see Table 2) generate 
the following assembly code for pass¬ 
ing the parameters and establishing the 
function context: 


MOVE.W 

-4 [A6], DO 

;stack bits 16 variable 

LI: .ascii 

“bits8=%c bitsl6=%d” 

EXT.L 

DO 





MOVE.L 

DO, —[A7] 


const 

lr2,Ll 


MOVE.B 

—1 [A6], DO 

;stack bits8 variable 

consth 

lr2,Ll 


EXTB.L 

DO 


add 

Ir3,lr6,0 

;move bits8 and bits 16 

MOVE.L 

DO, —[A7l 


add 

Ir4,lr8,0 

;to bottom of the 

PEA 

L13 

;stack text string ptr 



;activation record 

JSR 

_printf 


call 

lrO,_printf 

;retum addr in IrO 

LEA 

12 [A7], A7 

;repair stack pointer 





Table 1. MC68000 instructions. 

Instruction 

Comment 

MOVE.W saddr,daddr 

Move 16 bits of data from saddr to daddr. 

MOVE.W -4 [A6], DO 

Source address is register indirect with displace¬ 
ment. Destination address is data register 
direct. The word at memory location -4 
relative to the current frame pointer (A6) is 
copied into data register DO. 

MOVE.B saddr, daddr 

Move 8 bits of data from saddr to daddr. 

MOVE.L saddr, daddr 

Move 32 bits of data from saddr to daddr. 

EXT.L data_register 

Extend the sign of 16-bit data to 32-bit register 
size. 

PEA LI 5 

Push address LI 5 onto the stack (A7). 

JSR _printf 

A jump to subroutine _printf is taken. The current 

PC value is first pushed onto the stack (A7). 

LEA 8 [A6], AO 

The address of the data object located 8 bytes 
above the current frame pointer (A6) is loaded 
into address register AO. 

LINK A6,#-32 

Push the frame pointer (A6) onto the stack. Then 
copy the stack pointer (A7) to the frame 
pointer (A6). The stack pointer (A7) is then 
lowered by 32 bytes. This instruction takes 
several cycles and is used in a procedure 
prologue. 

UNLK A6 

The frame pointer (A6) is copied to the stack 
pointer. A new frame pointer is then popped 
out of the stack. This instruction takes several 
cycles and is used in a procedure epilogue. 

RTS 

The address value for a subroutine return is 
popped out of the stack (A7) and loaded into 
the PC. 


LI 5: .ascii “char= %c short=%d” 


This assembly listing shows how parameters pass via the 
stack to the function being called. 

The LINK instruction copies the stack pointer A7 to the 
local frame pointer A6 upon entry to a routine. The param¬ 
eters passed and local variables in memory are referenced 
relative to register A6. 


Op code 

Operand C 

Operand A 

Operand B 

8 bits 

8 bits 

8 bits 

8 bits 


Figure 1. Am29000 fixed 32-bit instruction format. 
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Table 2. Am29000 instructions. 

Instruction 

Comment 

const reg, value 

The value is placed in the selected register. This is a 
data-direct instruction. Operand fields A and B hold 
the 16-bit constant. 

const Ir2, LI 

The lower 16 bits of address LI are placed in local 
register Ir2. The high 16 bits of Ir2 are cleared. 

consth Ir2, LI 

The high 16 bits of address LI are placed in the high 

16 bits of local register Ir2. 

add des, srcA, srcB 

The two source operands A and B are added, and the 
result placed in the destination register. 

add Ir3, Ir6, 0 

Zero is added to the Ir6 value, and the result is placed 
in local register Ir3. This is effectively a register-to- 
register move operation. 

call IrO, _printf 

A jump to subroutine _printf occurs in the cycle 
following the current cycle. The current PC address 
plus 8 is placed in local register IrO. This is 
effectively a subroutine call with the destination 
address obtained by adding the current PC value 
and the 16-bit value formed by operand fields A 
and B. The Am29000 implements delay-slot 
branching, thus it always executes the instruction 
following a branch instruction. Direct 32-bit address 
procedure calls are supported with the CALLI 
instruction. 

jmpi IrO 

The value in local register IrO is loaded in the PC. 
Execution at the destination of a jump commences 
after the instruction following the JMPI has 
executed. This instruction implements a subroutine 
return. 

asgeu V_SPILL, grl,rab 

This is a conditional trap instruction. Such instructions are 
used in procedure prologue and epilogues to check 
whether cache spilling or filling is required. Trap number 
V_SPILL is taken if the assertion that the global register 
grl is greater than or equal to global register rab is false. 


Register stack. We define a register 
stack in an assigned area of memory to 
pass the parameters and allocate working 
registers to each procedure. The register 
cache replaces the top part of the register 
stack, as shown in Figure 2 on page 26. 

The global registers rab and rfb point 
to the top and bottom of the register 
cache. Global register rsp (also known 
as grl) points to the top of the register 
stack. The register cache, or stack win¬ 
dow, moves up and down the register 
stack as the stack grows and shrinks. Use 
of the register cache allows data to be 
accessed through local registers at high 
speed. On-chip triple-porting (two read 
ports and one write port) enables the 
register stack to perform better than a 
data memory cache, which cannot read 
and write in the same cycle. 

Activation record. Our processor 
does not apply push or pop instructions 
to external memory. Instead, each func¬ 
tion is allocated an activation record in 
the register cache at compile time. Acti¬ 
vation records hold local variables and 
parameters passed to the function. 

The caller stores its outgoing argu¬ 
ments at the bottom of the activation 
record. The called function establishes a 
new activation record below the caller’s 
record. The top of the new record over¬ 
laps the bottom of the old record, so the 
outgoing parameters of the calling func¬ 
tion are visible within the called 
function’s activation record. 

Although the activation record can be 
any size within the limits of the physical 
cache, the compiler will not allocate 
more than 16 registers to the parameter¬ 
passing pan of the activation record. Functions that cannot 
pass all their outgoing parameters in registers must use a 
memory stack for additional parameters (global register msp 
points to the top of the memory stack). This happens infre¬ 
quently, but it is required for the parameters that have their 
address taken. Data parameters at known addresses cannot 
be supported in register address space because data addresses 
always refer to memory and not to registers. 

The following code shows part of the printf( ) function for 
the Am29000 processor: 

_printf: 

sub grl, grl, 16 function prologue 


asgeu 

V_SPILL, grl, rab 

;compare with 
;top of window 

add 

in, grl, 36 

;rab is grl 26 

jmpi 

IrO 

; return 

asleu 

VJFILL, lrl, rfb 

;compare with 
; bottom of 



;window grl 27 


The register stack pointer rsp points to the bottom of the 
current functions activation record. All local registers are refer¬ 
enced relative to rsp. Four new registers are required to sup¬ 
port the function call shown, so rsp is decremented 16 bytes. 
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Stack 
window 
moves up 
and down 


Higher 

addresses 

rfb points to the 
bottom of cache 
register window* 





Cache resident 
portion of stack 


(Grows down). 


Empty 



Register cache 

Lower 

addresses 


Memory resident 
portion of stack 


Empty 


Register 

stack 



rsp points to the 
top of the stack 


rab points to the 
^— top of the cache 
register window 


External memory 


* Since the stack grows down, the “bottom” (older) activation records 
are located above the “top” (newer) activation records. 


Figure 2. Register stack window. 


Rsp performs a role similar to the MC68000’s A7 and A6 
registers except that it points to data in high-speed registers, 
not data in external memory. 

The compiler reserves local registers lrl and IrO for special Higher address 


of the cache window (rfb). If the activa¬ 
tion record is not stored completely in the 
cache, then the fill overhead occurs. 

Performance. The register stack im¬ 
proves the performance of call operations 
because most calls proceed without 
memory access. The register cache con¬ 
tains 128 registers, so very few function 
calls or returns require register spilling or 
filling. 

Because most of the data required by a 
function reside in local registers, there is 
no need for elaborate memory-address¬ 
ing modes, which increase access latency. 
The function call overhead in the Am29000 
consists of a small number of single-cycle 
instructions; the MC68000 requires a 
greater number of multicycle instructions. 

Context switching 

Context switching occurs when one 
process gives up control of the CPU with¬ 
out terminating, and another process, 
which had previously given up control, 
resumes executing. When this happens, 
the state of the processor being used by 
the process (the context) must be saved, 
and the context of the other process must 


duties within each activation record. LrO contains the execu¬ 
tion starting address when it returns to the caller’s activation 
record. Lrl points to the top of the caller’s activation record. 
The new frame allocates local registers lr2 and lr3 to hold 
printf function local variables. 

As Figure 3 shows, the positions of five registers overlap. 
The three printf( ) parameters enter from lr2, lr3, and lr4 of 
the caller’s activation record and appear as lr6, lr7, and lr8 of 
the printf( ) function activation record. 

Spill and fill. If not enough registers are available in the 
cache when it moves down the register stack, then a V_SPILL 
trap is taken, and the registers spill out of cache into memory. 
Only procedure calls that require more registers than cur¬ 
rently are available in the cache suffer this overhead. 

Once a spill occurs, a fill (V_FILL trap) can be expected at 
a later time. The fill doesn’t happen when the function call 
causing the spill returns, but rather when some earlier func¬ 
tion that requires data held in a previous activation record 
(just below the cache window) returns. Just before a function 
returns, the lrl register, which points to the top of the caller’s 
activation record, is compared with the pointer to the bottom 


Top of 
activation 
record 



—► 


Ir8 

Incoming parameter 


Ir7 

Incoming parameter 


Ir6 

Incoming parameter 


Ir5 

Frame pointer 


Ir4 

Return address 


Ir3 

Local 


Ir2 

Local 


1— lrl 

Frame pointer 

—► IrO 


grl (rsp) 
when _printf 
executes* 




printf activation 
record 
is 9 words 


- Base of caller’s 
activation record 
(grl before 
_printf is called) 

■ Base of _printf 
activation record 


* grl is lowered 4 words (16 bytes) in 
the prologue of _printf. 


Figure 3. Overlapping registers. 
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be restored. Other than allowing multiple processes to share 
the same CPU, this saving and restoring of context perform 
no useful work. If context switching is performed frequently, 
it should be a low-overhead operation. 

One result of having a large register file is an increased 
context-switching time. Approximately half of the 128 local 
registers and about 32 of the global registers contain data for 
the user’s process. About 96 memory writes and 44 memory 
reads are required to store and reload these registers with a 
new context. However, with single-cycle, burst-mode access, 
the register save and restore time can be as short as 5.6 (is. 
Note that the context reloading for synchronously saved pro¬ 
cess states is faster than context saving because only the acti¬ 
vation record of the currently active procedure need be 
restored. The number of local registers requiring restoring de¬ 
creases from approximately 64 to, typically, 12. 

Memory access time dominates the total context switch time. 
In a low-cost system with no external data cache, this is most 
apparent. When using five-cycle read/write memory, the 
memory access time is 28 (is. In practice, any increase in con- 
text-switching time due to the large register set is more than 
offset by the savings in normal execution. This is because 
context switches (30 to 60/s) are less frequent than system 
calls (150 to 250/s) or subroutine calls (about 350,000/s). The 
Am29000 can switch context in as little as 10 |is, about 45 
times faster than the VAX 11/780. 

System calls 

System calls are trusted library functions that execute with 
kernel mode permissions to access resources otherwise un¬ 
available to the user. To users, a system call looks like a nor¬ 
mal C function call. But when a user program executes a 
system call, it leaves user mode and enters kernel mode. 

Permission levels. When a system call is made, the se¬ 
lected function executes a trap instruction to change the pro¬ 
cessor status. After the trap is taken, the process operates with 
kernel mode permissions. It is usually necessary for the oper¬ 
ating system to copy the calling parameters from the user mode 
register stack to the kernel data space, because system call 
functions receive their parameters from kernel data memory 
space. To enable system functions to appear as ordinary C 
functions, the register stack parameters are transferred into a 
kernel data structure known as the upage by a system call 
dispatcher routine. 

The kernel mode pennissions of system calls may not ac¬ 
cess data that the user would not normally be able to access. 
When moving data between user space and kernel space, 
many processors use assembly level instructions to manipu¬ 
late the memory management unit (MMU) or other protection 
hardware. This allows the kernel to access the user data space 
with the user’s permissions rather than kernel’s permissions. 

The usual set of Unix functions available includes a fetch 
user byte, fubyte(a), and a store user byte, subyte(a,v). As an 


example, the statement 

result = fubyte(addr); 

fetches a byte of data from user address addr and stores it in 
the variable result. This variable is in kernel data space, and 
the access of the user data space occurs with the user’s per¬ 
missions. The subyte(a,v) function moves data in the opposite 
direction, from kernel space to user space. This routine is re¬ 
quired to modify data structures in the user’s address space. 

Protection boundaries. Our processor efficiently copies 
data across protection boundaries in a Unix implementation. 
System call parameters are not located in the user’s data 
memory space but in the register cache, where they don’t 
result in a heavy overhead. 

After the system call trap is taken, call parameters can be 
copied from the registers into the upage without any permis¬ 
sion problems because the register cache is wholly owned by 
the user process. There is no risk of the kernel accessing data 
belonging to another user. The store multiple instruction 
moves data quickly from registers to memory. 

The processor handles system call return values the same 
way as any other function call return values. Global registers 
from gr96 through grill hold the first 16 words of a function 
return value. Any other data in the user’s data space that the 
system call uses or updates can be accessed by setting the user 
access (UA) bit in load and store (LOAD, STORE) instructions. 
The UA bit allows programs executing in kernel mode to emu¬ 
late user mode access. Doing so enables functions such as 
subyte( ) to be implemented inexpensively, particularly since 
the UA bit can be used with the load and store multiple 
(LOADM, STOREM) instructions. 

Overhead. Because few accesses to data memory are re¬ 
quired, the overhead of switching from user mode to kernel 
mode is small. In kernel mode, separate memory and register 
stacks are used, but they are still accessed via global registers 
rab, rfb, msp, and rsp. Upon entering kernel mode, these reg¬ 
isters contain user mode values which are then stored in the 
upage (part of the context save) and replaced with addresses 
of kernel-mode memory and register stacks. The register stack 
is not pushed to memory, but rather the kernel register stack is 
added to the end of the cache. 

If the barriers between user and kernel modes are high and 
secure, the computation required for a mode change may far 
exceed the cost of the actual requested operation. In cases of 
extremely poor design, the simplest system calls may require 
hundreds or even thousands of overhead instructions. 

In keeping with RISC concepts, we structured the system 
call operation to ensure the minimum overheads at each stage. 
As an example of the results achieved, getpid( ) executes in 
about 10 (is, depending on the memory architecture. The same 
function call has been measured at 400 (is on a VAX 11/780. 
The system call mechanism results in an overhead that is one 
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hundredth of that incurred by a VAX 11/780. 

Interrupt handling 

As a RISC processor, the Am29000 limits the mechanism for 
handling interrupts and traps so system designers need not be 
concerned about when, where, or how much state is saved. 
State saving is not built into the processor interrupt mecha¬ 
nism but is left for the programmer to implement. Interrupts 
use a vector table to select the required service routine. After 
nine cycles (0.36 |is) or less, the processor executes the ser¬ 
vice code. A freeze mode makes this possible. 

Freeze mode. When an interrupt or trap occurs, the pro¬ 
cessor enters freeze mode. The hardware execution context 
described by critical CPU registers is not saved, but the state of 
these registers is frozen and cannot be updated until an inter¬ 
rupt return is taken or freeze mode is turned off. Interrupts 
that do not require the use of processor special registers can 
be serviced in freeze mode. This allows a fast interrupt service 
response. In fact, it is practical to implement a software cache- 
controlled reload in low-cost systems. 

Certain critical registers are required for the execution of C 
code routines. If the interaipt-handling code is just assembly 
glue used to reach a C code routine, then the critical registers 
must be saved explicitly before freeze mode is turned off. 
Users can tailor the form of the register context stacking pro¬ 
cedure to meet special needs. This facility can prove useful in 
optimizing system operation. 

Interrupt servicing. Our processor’s high-speed opera¬ 
tion enables interrupts to be handled by a Unix implementa¬ 
tion without assigning a priority order. When an interrupt is 
detected, it will be serviced immediately if no other interrupts 
are currently being serviced. Otherwise, the interrupt is added 
to the end of a queue. The processor continually tries to empty 
this queue and return to executing user processes. 

When the first interrupt occurs, possibly initiating a queue, 
the processor may be in user mode or kernel mode. Although 
each process executes the same kernel code, each process has 
its own kernel stack for kernel mode function calls and data. 
In user mode, the kernel memory stack stores data during the 
interrupt service routines. If the interrupt occurs in kernel 
mode, then a separate shared interrupt stack is used. 

Although I have described only one stack, the processor 
contains two: one for conventional memory data and one for 
register data that has spilled out or swapped out of the register 
cache. 

Virtual address space 

Unix is a multiprocess system, with several processes ex¬ 
isting in different areas of system memory at the same time. 
Programs may be loaded at some memory location and then 
change location during execution due to being swapped out 
and back in again. Programs are usually intended to execute 
from virtual address zero, but the actual physical memory 


used by the program is dependent on the relocation function 
of the MMU. The program, or parts of the program, may get 
swapped to new physical locations. But because of the ad¬ 
dress translation of the MMU, they always appear to be at the 
same virtual address. 

On-chip MMU. We built a standard MMU into the Am29000 
chip to lower system cost and maintain compatibility among 
29000 users. Most other processors require expensive exter¬ 
nal hardware to translate addresses. In such cases, designers 
sometimes prefer to develop their own MMUs. This can lead 
to incompatibilities with code developed for other designs. 

Translations. The Am29000 uses a 64-entry translation 
look-aside buffer (TLB) to perform address translations. The 
TLB translates the most frequently used instruction and data 
memory pages and reflects the information contained in more 
extensive tables in memory. Data addresses are translated 
during each execute processor stage. Instruction addresses 
are translated whenever a jump is taken, achieving virtual 
memory support without any access delays. 

Because of the TLB’s limited size, there is a chance that an 
access will be requested for a memory page not currently 
translated—a TLB miss. When this occurs, the special register 
lru (least-recently used TLB entry) selects a TLB register for 
replacement with new translation data. 

The TLB registers are not updated automatically by hard¬ 
ware. When a TLB miss occurs, a trap to a software routine 
that maintains TLB entries is performed. Using this method to 
update the TLB allows a variety of memory management 
techniques to be implemented. Systems that rely on hard¬ 
ware to update the TLB registers do not achieve this level of 
flexibility. The technique is particularly well suited to the 
emulation of missing hardware accessed in virtual address 
space. A TLB register can be updated in two Am29000 cycles 
(one for each word of a TLB entry). This keeps down the 
software overhead of reloading TLB. Depending on page- 
table layout, a TLB reload needs 25 to 40 cycles after a page 
miss trap occurs. 

Memory access protection 

In any multiprogramming system, especially one in which 
users are developing, debugging, and testing software, pro¬ 
cess a should be forbidden access to the data of process b if 
accessing that data would impair process b. In addition, some 
operating system data should be protected even from read¬ 
only access. 

Reducing memory usage and permitting efficient 
multiprocess cooperation requires controlled sharing of re¬ 
gions of memory and other resources. For many programs, a 
large portion of executable code consists of the system-pro¬ 
vided library subroutines. For small programs, the library code 
used by the program can be larger than the program itself, 
particularly if the program makes heavy use of standard I/O 
routines. Enhanced Unix systems provide shared libraries so 
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only one copy of the library code is required for all the pro¬ 
cesses in the system. 

Protection violation checking. The TLB hardware checks 
for protection violation. Each TLB entry contains a 6-bit field 
that controls access to the associated page, which is assigned 
to a particular user through a task identifier. Permissions can 
be set to separately enable reading, writing, and execution of 
page data. The processor mode and user ID are checked 
with the access permissions for the virtual memory page ac¬ 
cessed. A software trap occurs if access is not permitted. In¬ 
cluding user identifiers in each TLB entry enables Unix to 
switch processes without clearing all TLB entries. Performance 
is improved because valid TLB entries are likely to remain 
and be reused when their associated process is swapped 
back in. 

Protection control. The Am29000 offers considerable 
access control within the chip. Other systems that use exter¬ 
nal hardware to achieve a similar function incur a greater 
system cost. Also, by including the necessary hardware in 
our design, we achieve a standard that cannot be ensured 
when designers build their own protection systems. 

Cache support 

The Am29000 incorporates an on-chip branch target cache 
(BTC) memory. If the target instructions of a branch are found 
in the cache, the branch executes in one cycle, causing no 
stalling of the processor pipeline. For sequential instruction 
access, an instruction prefetch buffer ensures that instruc¬ 
tions are fetched up to four cycles ahead of their execution, 
enabling single-cycle instruction execution at high speeds. 

The BTC memory contains the target instruction sequence 
for 32 branches. Simulation results show this is sufficient for 
60 percent of the branch instructions. If a BTC memory miss 
occurs, the hardware automatically updates the cache with 
the new target instruction sequence. In such cases, the ex¬ 
ecution pipeline stalls for one cycle plus the number of cycles 
required to access the instruction memory. Having an on- 
chip instruction cache enables users to implement low-cost 
systems that achieve high speeds. 

Data caching. The Am29000 processor caches data via a 
large register file to reduce the number of memory loads and 
stores. This reduces the requirement for an external data cache. 
Of the 192 general-purpose registers, 128 serve as a register 
cache. (See Figure 4.) This cache stores data variables and 
function call parameters by using an overlapping stack frame 
technique. 

On occasion the cache becomes full and some data spills 
out to memory. Simulation results of programs such as nroff 
show this is necessary in only 0.1 percent to 0.5 percent of all 
calls. Since calls typically constitute 1.2 percent of all instruc¬ 
tions used, and spilling out and filling back typically require 
only 25 to 30 cycles, the overhead is low. 

Accessing data in the register file does not suffer the delays 


encountered in accessing external data, even if a high-speed 
cache is used. Thus the Am29000 processor is well suited for 
even higher speed devices in future technologies. The regis¬ 
ter file has a further advantage in that it is triple-ported. It 
accesses two sources in one cycle, while a previously com¬ 
puted result is written back. Subsequent accesses take one 
cycle, using burst mode. 

On-chip instruction caching. BTC memory caches user 
mode and kernel mode instructions. The cache tag logic works 
with physical addresses, possibly after they have gone through 
the virtual-to-physical address translation service of the TLB. 
The BTC memory entries need not be invalidated on a con¬ 
text switch. Keeping entries valid for future reuse improves 
performance. However, when the kernel decides to swap 
user process pages in or out of disk memory, instructions at 
cached physical memory locations may change, invalidating 
the cache. 

The newer Am29030 processor contains an 8-Kbyte in¬ 
struction cache. The addition of the cache was made pos¬ 
sible by the higher levels of integration not possible when 
the Am29000 was introduced. Research shows that for cache 
sizes above 4 Kbytes, a conventional cache provides supe¬ 
rior performance to the BTC memory method. 5 
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Used to implement global support 


Registers such as 
rab, rfb, and msp 


Not implemented 


Also known as rsp; 
points to the top of the cache 

* Registers in the cache are normally referred to as local registers and 
are accessed relative to grl. Grl points to local register IrO. Local 
register Irl is located at register address grl + 4. All registers are 32 
bits wide. 
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Figure 4. Am29000 register stack. 
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Because the BTC memory reduces initial instruction access 
latency rather than improving memory access bandwidth, the 
Am29000 uses a separate bus to support concurrent instaic- 
tion fetching. The Am29030 does not require separate buses 
for instruction and data memory because the on-chip cache 
offers a statistical improvement to the memory bandwidth. It 
maintains sufficient instruction memory bandwidth without 
the support of a dedicated bus. 

Multiprocessor Unix 

Our processor supports on-chip multiprocessor systems 
without the need for large amounts of external hardware. It 
also supports the mechanisms to ensure processor resource 
sharing and signaling. 

An external coprocessor interface extends its execution unit 
with special-purpose instructions. If the coprocessor hard¬ 
ware is not present, then a coprocessor exception trap is 
taken. The processor emulates the coprocessor operation for 
hardware development and code compatibility. 

Multiprocessor cache support. Each TLB entry contains 
2 bits to control external cache operation. Cached pages can 
be marked selectively as not-to-be-cached, local, or shared. 
This enables the operating system to inform the cache of the 
correct action to take, and it results in reduced data traffic on 
the shared-memory bus giving access to the cache. 

A load and lock (LOADL) instruction for device and memory 
interlocks activates the LOCK output pin during the address 
cycle of the access. Multiprocessor hardware uses the lock 
signal to delay access of other processors requesting the 
same memory or I/O facility. 

The load and set (LOADSET) instruction implements a bi¬ 
nary semaphore. The operation of writing a memory sema¬ 
phore location to True after reading the location into a 
general-purpose register, cannot be interrupted. 

Floating-point support 

Unix is a product of the academic and scientific environ¬ 
ment, which means that a number of its user application 
programs require floating-point support. The newer Am29050 
processor supports floating-point instaictions directly. How¬ 
ever, the Am29000 processor instruction set contains a num¬ 
ber of floating-point instructions that are 
not directly supported. 

When a floating-point instruction is 
encountered, it traps to a handler that 
sends the appropriate instruction and 
data to an optional coprocessor, the 
Am29027, and moves the results back to 
the Am29000. Without the coprocessor, 
the Am29000 calls software routines to 
perform the instruction evaluation. This 
latter method is much slower than using 
a coprocessor. But by using the trap 


method, code developed for the Am29000, which may or 
may not have an Am29027 coprocessor available, will run 
directly on future Am29000 processors (and at much greater 
speed). 

A library of software routines has been developed that 
makes extensive use of the floating-point instructions to per¬ 
form functions. We reserved 13 user-mode-accessible global 
registers for math library use to achieve good performance 
for these functions. We can reduce the number of global 
registers by substituting memory locations, but doing so re¬ 
duces performance. Also, all library users must agree on reg¬ 
ister assignment if the library is to be shared. 

System costs 

The on-chip MMU functions enable the Am29000 to oper¬ 
ate without caches and maintain good speed. Table 3 draws 
on information in the “Am29000 Performance Analysis” docu¬ 
ment. 6 The table compares AMD’s 29000 processors running 
at 23 MHz against Sun processors and the VAX 11/780. 

The benchmarks for the Am29000 in Table 4 are for sys¬ 
tems with separate instruction and data memory systems. The 
Am29000 has separate instruction and data memory buses 
and a shared address bus. The need to implement a memory 
system for instructions and a separate system for data can be 
avoided by connecting the instruction and data buses. 
Doing so makes the Am29000 processor bus system appear 
more like a conventional processor bus system. However, 
sharing a single memory system reduces performance by, 
typically, 15 percent. The use of video memory (video DRAM) 
also avoids the need for two memory systems, as the video- 
out port can be connected to the instruction memory bus, 
and the random access port to the data bus. 

Table 4 gives the approximate costs of a CPU and memory 
devices. It also shows the cost for a memory system based on 
4 Mbytes of DRAM with an additional 0.5-Mbyte SRAM used 
to implement a cache. Studies show that software-controlled 
caches can efficiently use high-speed memory with a mini¬ 
mum of hardware complexity and cost. 7 The lightweight in¬ 
terrupt-handling capability achieved by the Am29000 processor 
operating in freeze mode is well suited to implementing a 
soft cache control mechanism. 


Table 3. CPU speeds. 

Benchmark 

Am29000 Am29030 

Other processors 

SRAM 

DRAM 

DRAM 

Sun 4 

Sun 3 

VAX 

diff 

20.35 

12.02 

20.86 

8.50 

3.56 

1.0 

9 rep 

14.81 

9.42 

18.54 

7.00 

3.00 

1.0 

nroff 

14.92 

10.78 

18.31 

7.50 

2.30 

1.0 

Dhrystone 2.0 

23.00 

15.30 

19.69 

11.71 

2.65 

1.0 
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Table 4. CPU and memory costs. 




Am29000 



Am29030 

Feature 

SRAM 

DRAM 

Soft cache 

Video DRAM 

DRAM 

DRAM 

Instruction memory 

4 Mbytes 

4 Mbytes 

4-Mbyte DRAM, 
0.5-Mbyte SRAM 

4 Mbytes 

4 Mbytes 

4 Mbytes 

Data memory 

4 Mbytes 

4 Mbytes 

Shared 

Shared 

Shared 

Shared 

Cost 

$2,792 

$424 

$352 

$448 

$264 

$262 


The Am29030, unlike the Am29000, shares a bus for in¬ 
struction and data memory and thus does not require sepa¬ 
rate memory systems. It achieves higher performance via its 
on-chip, 8-Kbyte instruction memory cache. The Am29050, 
which is currently the only 29K family member that directly 
executes floating-point instructions, is pin compatible with 
the Am29000 processor and thus supports the three-bus 
architecture. 

Because our processor uses only one address bus, the num¬ 
ber of pins and the system costs are reduced. The one- 
address bus technique is successful because of burst-mode 
addressing and the large register cache. The register file greatly 
reduces the need to access data in external memory. Burst¬ 
mode addressing enables instructions to be fetched in se¬ 
quence without sending a new address for each access. The 
address bus is only required to establish the address of the 
first instruction in a nonsequential instruction fetch. Video 
DRAM or conventional RAM interleaved with a small amount 
of support circuitry, such as an address counter, can be used 
to implement burst-mode memory addressing. 

Contention for the address bus for both instruction and 
data access is rare. By the very nature of the instruction stream, 
a jump instruction cannot occur on the same cycle as a load 
or a store operation, thus the address bus is inherently used 
for instruction and data addressing on different cycles. 


The Am29000 processor’s reduced instruction 

set results in concise, efficient executable code. Each proces¬ 
sor in this family sustains a performance level of 17 VAX 
MIPS or higher. In addition, the processors have several de¬ 
sign features that make them particularly suitable for efficient 
Unix implementation, both in terms of operating speed and 
support peripherals required. 

Among these is a scalable methodology that frees the sys¬ 
tem implementer from a fixed-price, fixed-architecture ap¬ 
proach. Because the architectural features required to support 
Unix are packaged along with the processor, the Am29000 
processor is particularly suited to use in low-cost systems. (JJ 
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Hardware Requirements for 
Neural Network Pattern Classifiers 

A Case Study and Implementation 


A special-purpose chip, optimized for computational needs of neural networks, performs 
over 2,000 multiplications and additions simultaneously. Its data path is suitable particularly 
for the convolutional architectures typical in pattern classification networks but can also be 
configured for fully connected or feedback topologies. A development system permits rapid 
prototyping of new applications and analysis of the impact of the specialized hardware on 
system performance. We demonstrate the power and flexibility of the processor with a neural 
network for handwritten character recognition containing over 133,000 connections. 
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eural networks are rapidly gaining ac¬ 
ceptance as powerful and versatile 
tools for pattern classification. 1 - 2 How¬ 
ever, the widespread use of neural 
network classifiers remains contingent on the 
availability of powerful hardware to provide ad¬ 
equate speed. Such hardware is particularly im¬ 
portant as the computational requirements of 
neural network algorithms are quite different from 
the high-precision processing for which general- 
purpose computers are optimized. Typical neu¬ 
ral network problems involve huge amounts of 
low-resolution and possibly redundant data and 
require a correspondingly high number of low- 
precision arithmetic operations to be performed. 

Special-purpose VLSI (veiy large-scale integra¬ 
tion) processors let us overcome neural network 
implementation problems. With their regular struc¬ 
ture and the small number of well-defined arith¬ 
metic operations, these networks are well matched 
to integrated circuit technology. The high den¬ 
sity of modern technologies lets us implement a 


large number of identical, concurrently operat¬ 
ing processors on one chip, thus exploiting the 
inherent parallelism of neural networks. The regu¬ 
larity of neural networks and the small number 
of well-defined arithmetic operations used by 
neural algorithms greatly simplify the design and 
layout of VLSI circuits. 

But processing speed is not the only constraint 
on neural network hardware design; neural net¬ 
work classifiers benefit from a highly structured 
topology with local receptive fields. Of particu¬ 
lar importance are convolutional architectures in 
which neurons with identical weights process 
different parts of the input or internal state. This 
topology builds into the network knowledge 
about locality of data, improving recognition per¬ 
formance. At the same time it lets us multiplex 
neurons with identical sizes and realize the large 
networks required for difficult classification tasks 
within the density limitations of current VLSI tech¬ 
nology. The high speed of VLSI technology— 
five orders of magnitude greater than that of 
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natural neurons—compensates for the loss of computing 
speed resulting from partial serial processing in such an imple¬ 
mentation. 

Matching the arithmetic precision of the hardware to the 
requirements of neural networks is crucial for an efficient 
hardware implementation. Neural networks are often quoted 
for their low-resolution requirement, which lends itself well 
to analog implementation. Experiments, however, show that 
the precision requirements of neurons within a single net¬ 
work vary. Specifically, research indicates that higher accu¬ 
racy is often needed in the output layer, for example, for 
selective rejection of ambiguous or otherwise unclassifiable 
patterns. This situation can be handled with a hybrid archi¬ 
tecture, which evaluates the bulk of the network with low- 
resolution analog hardware but implements selected 
connections on a digital processor with higher accuracy. 

Based on these considerations, we designed and fabricated 
ANNA (Artificial Neural Network ALU), a special-purpose chip 
for neural network pattern classification. 3 The chip features 
4,096 individual synapses that can be multiplexed to imple¬ 
ment networks with several hundred thousand connections. 
We can program the number of synapses per neuron to val¬ 
ues between 16 and 256; the number of neurons varies ac¬ 
cordingly between 256 and 16. Implementable network 
topologies include fully connected architectures with or with¬ 
out feedback, local receptive fields, and TDNNs (time-delay 
neural networks) or higher order convolutional connections. 
Depending on the network topology, the chip can sustain a 
performance of up to 5 x 10 9 connections per second (C/s). 
Arithmetic operations take place with 6-bit resolution on the 
weights and 3-bit resolution on the states. 

Because of the high speed, parallelism, low resolution, 
and other characteristics of ANNA, which differ considerably 
from those of general-purpose computers, this hardware must 
be readily accessible to the network designer, so new appli¬ 
cations can be prototyped easily. A development system, con¬ 
sisting of a PC or VME board containing an ANNA chip and a 
digital signal processor (DSP), addresses these needs. We 
download the topology and weights of a network into the 
VME board, and the DSP issues control commands to ANNA. 
The DSP also preprocesses and trains the networks and pro¬ 
cesses operations that require higher precision than that of 
ANNA. 

We selected an optical character recognition neural net¬ 
work to test and demonstrate the flexibility and power of the 
neural network chip. 1 This network identifies handwritten 
digits from a 20 x 20-pixel input image and employs neurons 
with local receptive fields as well as a fully connected layer. 
The network with over 133,000 connections fits on one ANNA 
and is evaluated at a rate in excess of 1,000 characters per 
second, which constitutes a speedup of two orders of magni¬ 
tude over a DSP-based implementation. Despite the low reso¬ 
lution of the chip, the error rates of the neural network 


Our OCR neural network with 
133,000 connections identifies 
handwritten digits at 1,000 cps 
and fits on one ANNA chip. 


processor and DSP implementation are very similar at 5-3 
percent and 4.9 percent. For comparison, the measured hu¬ 
man performance on the same database is 2.5 percent errors. 

Neural network hardware 

Artificial neural networks that solve difficult problems in 
areas such as speech recognition and synthesis, or pattern 
classification, consist of thousands of neurons with tens or 
hundreds of inputs each. Every neuron computes a weighted 
sum of its inputs and applies a nonlinear function to its re¬ 
sult. Architectural parameters, such as the number of inputs 
per neuron, and each neuron’s connectivity vary consider¬ 
ably within a network, and from application to application. A 
special-purpose neural network processor must be flexible 
and powerful enough to accommodate a wide range of ap¬ 
plications. At the same time, the requirements must be care¬ 
fully balanced and the special nature of the task exploited to 
bring an efficient implementation within reach of today’s 
technology. 

We can distinguish two phases of operation in many neu¬ 
ral network applications. During the learning phase, the to¬ 
pology and weights of the network are determined from a 
labeled set of examples using a rule such as backpropagation, 4 
or a network-growing algorithm. 5 In the subsequent retrieval 
or classification phase, the network parameters are fixed. 

The network recognizes patterns based on information 
stored in the architecture and weights during training. Since 
the computational and infrastructure requirements (training 
database) during the learning phase are considerably more 
complex than those for classification, efficiency considerations 
call for separate hardware for learning and retrieval. Network 
parameters determined during learning are downloaded into 
processors specialized for the classification task. This approach, 
which we focus on here, contrasts with implementations of 
neural network processors with on-chip learning. 6 " Those 
circuits are not suitable for the pattern recognition problems 
we investigate here, because of limitations of the training 
algorithms implemented on these chips or because of the 
limited size of the network that can be trained. 

The basic operation performed by a neuron during classi- 
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fication is a weighted sum, followed by a nonlinear squash¬ 
ing function f typically a hyperbolic tangent or approxima¬ 
tion thereof: 

y = /(I, x t Wi + b). 

We generally refer to the inputs x t of the neuron as con¬ 
nections and the Wf parameters as weights. Each input is 
either tied to the output y of another neuron or to an external 
input. Optionally, a bias b may be added to the weighted 
sum. 

The total number of connections in neural networks for 
applications such as handwritten character recognition may 
amount to 10,000 to several hundred thousands. 8 Networks 
that solve more general problems, such as recognition of 
entire words instead of isolated characters, require even larger 
numbers of connections. The speed requirements of typical 
applications call for a few tens to several thousands of classi¬ 
fications per second. For each classification, the network must 
evaluate one multiplication and one addition for every con¬ 
nection, which translates to a few billion multiply-add opera- 


ANNA evaluates dot products of 
state and weight vectors and 
applies a nonlinear squashing 
function to results. 


tions per second. Only parallel implementations, in which 
several connections are evaluated concurrently, achieve such 
computational power. 

The most general network topology permits connections 
between any two neurons. Such a high degree of (possible) 
connectivity, combined with the need for parallel process¬ 
ing, results in enormous hardware requirements, and there¬ 
fore calls for a compromise. Usually, the neurons in a network 
are arranged in layers, each of which receives inputs only 
from neurons in the previous layer. Layers may be fully con¬ 
nected; that is, each neuron may be connected to every neu¬ 
ron in the preceding layer. Often, however, we use local 
connectivity to express knowledge about the problem (geo¬ 
metric relations such as the neighborhood of pixels in an 
image) in the network architecture and thus improve the rec¬ 
ognition performance. 1 

For example, the fact that some pixels in an image are 
adjacent to each other can be built into the network architec¬ 
ture by constraining neurons to receive inputs only from neigh¬ 


boring pixels. In a fully connected topology, such informa¬ 
tion must be derived from the training set during the learning 
phase, usually meeting with only partial success. 

A neural network processor could be designed to imple¬ 
ment only networks with fully connected topology. Local 
connectivity would then be realized by simply setting the 
weights of unused connections to zero. Since, in typical neu¬ 
ral networks, the ratio of such unused connections to actual 
connections is easily 100, such an implementation is unac¬ 
ceptably inefficient. The added complexity of the hardware 
required to support local connectivity is no match for the 
millions of connections saved. 

Another challenge for a compact hardware implementa¬ 
tion of a classifier is the amount of memory needed for stor¬ 
ing several tens or hundreds of thousands of weights. 
Fortunately, the weights of many neurons in important con¬ 
nection topologies, including time-delay or feature extrac¬ 
tion neural networks, 1,2,9 are identical. In these architectures 
the connection topology corresponds to a one- or higher 
dimensional convolution, followed by the nonlinear squash¬ 
ing function, as is illustrated. We can realize such a structure 
with a single, time-multiplexed neuron with a corresponding 
saving of storage and computing devices. 

We can further optimize the hardware complexity by match¬ 
ing the computational accuracy of the processor to the re¬ 
quirements of typical neural networks. Both experience and 
theory 10 indicate that neural network classifiers can be de¬ 
signed to be insensitive to low-resolution arithmetic. Experi¬ 
ments with character recognizers show that the recognition 
performance remains virtually unchanged when the inputs 
and outputs of the neurons are quantized to 3 bits, and the 
weights to approximately 5 bits. Higher resolution is required 
in the last layer for the rejection of ambiguous or unclassifiable 
patterns. Since in typical neural networks the output layer 
contains only a small fraction of the total number of connec¬ 
tions, we reduce system complexity by evaluating those con¬ 
nections on a different processor with higher accuracy. 

ANNA 

Figure 1 shows the building blocks of ANNA, a neural net¬ 
work chip that implements the concepts just outlined. It con¬ 
currently evaluates several dot products of state and weight 
vectors and applies a nonlinear squashing function to the re¬ 
sults. Data enters the chip through a shift register, which reads 
up to four values at a time. A file with 16 vector registers stores 
intermediate results when multilayer networks are evaluated. 

The 64-word-wide (3 bits per word) shifter reads up to four 
inputs in each cycle. In this process, the current shifter con¬ 
tents shift left one to four word positions. The use of a shifter 
limits the number of pins required and supports convolutional 
network topologies and multiplexing of neurons with identi¬ 
cal weights. The shifter alone handles one-dimensional con¬ 
volutions, while an external data formatter; for example, a line 
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delay register, 11 is needed for two- or 
higher dimensional computations. Data 
loaded into the chip can be buffered tem¬ 
porarily in a file of 16 vector registers to 
reduce the required input bandwidth, to 
evaluate neurons with more than 64 in¬ 
puts, or to store intermediate results. 

Eight banks of vector multipliers per- 
form the actual computation. Each bank 
consists of a latch to hold the state vector 
plus eight vector ALUs with 64 synapses 
each. A multiplexer that can be config¬ 
ured to combine the contributions from 
one to four vector multipliers connects the 
outputs from the vector multipliers to the 
neuron bodies. When the latches of sev¬ 
eral vector multiplier banks hold different 
data, the network evaluates neurons with 
up to 256 inputs. The number of neurons 
depends on the number of inputs: Ex¬ 
tremes of 16 neurons with 256 inputs 
each, or 256 neurons with 16 inputs, as 
well as many intermediate arrangements 
are possible. 12 The topology can be rear¬ 
ranged on a per-instaiction basis to per¬ 
mit evaluation of several layers of a 
network with different architectures on a 
single chip without performance penalty. 

The neuron bodies first scale the out¬ 
put from the vector multipliers by a factor 
that can be set in the range 1/16 to 1/2 in 
eight levels to optimize the useful dy¬ 
namic range of the circuit. Then the neu¬ 
ron bodies evaluate the squashing 
function and convert the result to the 
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Figure 1. Block diagram of the neural network chip, ANNA. 


concuiTently with an ongoing CALC instruction. In 200 ns the 
chip can, for example, load eight states and store them in a 
register and two latches, and evaluate the dot product and 
nonlinear function of eight vectors with 256 components each. 
The weight refresh takes place simultaneously transparent to 
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Neural networks 


Table 1. System features. 

Characteristic 

Value 

Synapses 

4,096 

Bias units 

256 

Synapses per neuron 

16 to 256 

Weight accuracy 

6 bits 

State accuracy 

3 bits 

Input rate 

120 Mbps 

Output rate 

120 Mbps 

On-chip data buffers 

4.6 Kbits 

Computation rate (sustained) 

5 Gcps 

Refresh (all weights) 

110 ps 

Clock rate 

20 MHz 



Figure 2. Die photograph. The synapse array can be seen 
in the center, the shifter and register file on the left, the 
neuron bodies at the top, and the weight refresh DACs on 
the right. 


the user. Table 1 summarizes the features of the chip. 

The chip contains 180,000 transistors and measures 4.5 x 7 
mm 2 (see Figure 2). It was fabricated in single-polysilicon, 
double-metal, 0.9-pm CMOS technology with a 5V power sup¬ 
ply. The current drawn by the chip reaches 250 mA when all 
weights are programmed to their maximum value but is less 
than 100 mA in typical operation. 

Programmability is one of the key features of the neural 
network chip. Table 2 lists a selection of network topologies 
that can be implemented and the achieved performance in 
each case. The chip processes networks with full or sparse 
connection patterns of selectable size, as well as networks 
with feedback at a sustained rate of over 10 9 connections per 
second. 


Table 2. Sample network architectures 
and performance. 


Average performance 
Network topology (GC/s) 


Fully connected (one layer) 


64 inputs, 64 outputs 

2.1 

128 inputs, 32 outputs 

1.2 

32 inputs, 128 outputs 

1.2 

Local receptive fields 


64 x 1, 64 features 

2.3 

16 xl 6, 16 features 

4.7 

16 x 8, 32 features 

3.6 

Multilayer network 


64 inputs, 32 hidden, 


32 hidden, 32 outputs 

0.8 

Hopfield neural network 


64 neurons 

2.1 


Of particular importance for neural network pattern classifi¬ 
ers are neurons with local receptive fields and weight sharing, 
such as TONNs. 9 The neural network chip supports weight 
sharing in several ways. The shifter and register file enable 
loading of data and the computation to go on in parallel. Also, 
data that has been loaded onto the chip once can be buffered 
and reused in a later computation. Finally, rather than requir¬ 
ing separate hardware for all weights, neurons with identical 
parameters are stored only once. 

Development system 

As mentioned before, the characteristics of the ANNA chip— 
high speed, parallel computation, limited instruction set, and 
low resolution—differ considerably from those of general-pur¬ 
pose computers. Efficient algorithms that derive optimal ben¬ 
efit from the special processor can be designed only if the 
processor is available in the early design stages. A develop¬ 
ment system consisting of an ANNA, a workstation, and ap¬ 
propriate software addresses this requirement. Figures 3 and 4 
illustrate the hardware setup, which includes an ANNA and a 
20-Mflops DSP32C with 1-Mbyte fast static RAM. 

A DMA interface that directly maps the SRAM into the ad¬ 
dress space of the PC bus or VMEbus exchanges data with the 
host computer. The DSP, which is also used for pre- and 
postprocessing and for computations that require higher pre¬ 
cision than that of ANNA, generates instructions for ANNA. 
The entire system is controlled by a program running on the 
workstation that calls routines and exchanges data with the 
DSP transparently to the user. The software for the system is 
written in the high-level language C++, with the exception of 
a few time-critical routines that are handcoded in DSP assem- 
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VME- or PC/AT bus 


Figure 3. Block diagram of the neural network accelerator 
board. 

bly language. 

Networks are trained on the workstation or the DSP. Tire 
neural network chip can also be included in the training pro¬ 
cess, for example, to adjust the network to ANNA’S low-reso¬ 
lution processing. Training of individual chips is not necessary, 
however, because of the good matching between individual 
devices. Once trained, the network topology and weight val¬ 
ues are downloaded into the DSP for execution on ANNA. 

Character recognition 

Speed, capacity, and programmability are important aspects 
of neural network hardware. Their practical relevance, how¬ 
ever, must be proven on a real-world application, such as the 
implementation of an optical digit recognizer 1 on the neural 
network chip we describe here. This network has been trained 
with the backpropagation algorithm 4 to recognize handwritten 
digits from a 20 x 20-pixel image. The classification error rate 
on a test set consisting of 2,000 handwritten digits is 4.9 per¬ 
cent miss classifications, compared to a human performance 
of 2.5 percent on the same data. 

Figure 5 illustrates the architecture of the network; Table 3 
lists statistical information about each layer. The more than 
3,500 neurons with a total of over 133,000 connections are 
arranged in five layers. The first four layers employ a 2D con¬ 
volutional topology with various kernel sizes and subsampling 
factors. Because of weight sharing, the number of weights 
(free parameters) in these layers is much smaller than the num¬ 
ber of connections. The last layer is fully connected. We chose 
this topology to maximize recognition performance and classi¬ 
fication speed of an implementation on a floating-point 
DSP32C digital signal processor. 13 

Special steps are necessary to adapt the network to the low 
resolution of the chip. Simple quantization of all weight values 



Figure 4. Neural network accelerator with ANNA and the 
DSP32C. 


10 outputs 



0 Neuron 

0 Receptive field of neuron 


Figure 5. Architecture of the character recognition 
network. 


Table 3. Connectivity of character 
recognizer neural network. 


Layer 

Neurons 

C/S 

Weights 

5 

10 

3,000 

3,000 

4 

300 

1,200 

12 

3 

1,200 

50,000 

500 

2 

784 

3,136 

4 

1 

3,136 

78,400 

100 
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Figure 6. Sample chip output for optical character recognition. The gray levels encode the neuron state. 


results in an unacceptable loss of accuracy. However, experi¬ 
ments reveal that the computational accuracy provided by the 
chip is adequate for all but the 3,000 weights in the last layer 
of the network. This last layer is retrained with quantized data 
obtained from the chip to eliminate performance degradation. 
After retraining, the classification ernor rate on the test set is 5.3 
percent, compared to the original 4.9 percent. This result is 
obtained consistently with different chips for which the last 
layer has not been retrained individually. Figure 6 shows the 
input, output, and internal states of the neural network for a 
sample input that has been processed by the neural network 
chip. 

The first four layers of the network with 97 percent of the 
connections but only 6l6 weights fit on a single neural net¬ 
work chip. The remaining 3,000 connections of the last layer 
are evaluated on the DSP32C. The throughput of the chip is 
more than 1,000 characters per second or 130,000 connections 
per second. This figure is considerably lower than the peak 
performance of the chip (5G connections per second), a con¬ 
sequence of the small number of inputs of most neurons in 
the network for which the chip cannot fully exploit its parallel¬ 
ism. Nevertheless, the chip’s performance compares favorably 
to the 20 characters per second that are achieved when the 


entire network is evaluated on the DSP32C. The recognition 
rate of the chip is far higher than the throughput of the prepro¬ 
cessor, which relies on conventional hardware. Improvements 
of both the recognition rate and accuracy can be expected 
when the network architecture is amed to take full advantage 
of the parallelism of the ANNA chip. 


Neural networks are attractive for pattern classi- 

fication applications but suffer in practice from the limited 
speed that can be achieved with implementations based on 
classical processors. This problem can be overcome with 
highly parallel special-purpose VLSI circuits. While a fully par¬ 
allel implementation of sufficiently large networks is currently 
not feasible, we can achieve adequately high perfonnance with 
an architecture that exploits the limited connectivity and weight 
sharing that are typical for pattern classifiers. We demonstrated 
this performance with a neural network classifier with over 
133,000 connections that has been implemented on a single 
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neural network chip performing over 1,000 classifications per 
second. This result eliminates throughput from the constraints 
faced by network designers. The availability of fast special- 
purpose hardware for large applications sets the conditions to 
explore new neural network algorithms and problems of a 
scale that would not be feasible with conventional processors. 

We expect further advances when the architecture of the 
network is modified to fully take advantage of the chip’s par¬ 
allelism. While tire size of the current network has been con¬ 
strained by tire speed of conventional hardware, such issues 
vanish because of the high speed of the chip. The price for 
this throughput is the specialization of the circuit, specifically 
its low resolution, and its focus on neural network algorithms. 
Future research will benefit from the speed of the novel hard¬ 
ware but must also address questions regarding the limitations 
of special-purpose hardware. Furthermore, it appears attrac¬ 
tive to implement larger tasks (to include image location, seg¬ 
mentation, and scaling into the recognition process) with 
neural networks to benefit from the powerful hardware. IB 
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Experimentation with 
Hypercube Database Engines 


Effective algorithms appropriately selected for each task are essential if we are to take full 
advantage of parallelism in database systems. Some algorithms perform faster than others, 
depending on the uniqueness value of the data and the uniformity of distribution among the 
nodes in the system. A performance evaluation under varying experimental parameters de¬ 
termines the optimality of a given algorithm for a particular database task. 
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George Mason University 


B nterest in multiprocessors for database 
processing has grown beyond the re¬ 
search communities, into the devel¬ 
opment sectors. As the volume and 
variety of data stored, accessed, and manipulated 
grows, computer and communications designers 
are focusing on parallelism-in particular using 
general-purpose commercial multicomputers-as 
a way to meet database processing needs. 1 ' 3 

Using one such computer, Intel’s iPSC/2 
hypercube, we measured the relationship be¬ 
tween packet size, method of clustering messages, 
and internode traffic on the total sustained com¬ 
munication bandwidth. Having measured the 
costs associated with internode communication, 
we then analyzed duplicate removal algorithms. 
Duplicate removal is an integral part of several 
frequently used relational database operations, 
including PROJECT and UNION. We believe it 
is, therefore, important to develop efficient 
algorithms. 

We also studied the effects of nonuniformly 
distributed attribute values and tuples across pro¬ 
cessors on three proposed duplicate removal al¬ 


gorithms. We chose algorithms to represent the 
several available in the literature. 4 ' 7 We then evalu¬ 
ated the output collection time. 

A tutorial on multiprocessor/multicomputing 
databases 6 or current generation multicomputing 
architectures 8 is beyond the scope of this article. 
We intend only to present a brief overview of the 
iPSC/2’s hypercube message-passing system, and 
discuss the results of our experimentation and 
analysis. Although our algorithms were imple¬ 
mented on the iPSC/2, we believe they would 
work as well on other hypercube machines, such 
as Ncube 9 and Floating Point Systems' T series. 10 

iPSC/2 

Intel’s Personal Super Computer, the iPSC/2, is 
a multiple-instruction, multiple-data (MIMD), 
multicomputer with nodes connected via a 
hypercube topology. Although many definitions 
for a multicomputer system exist, we define it as 
a system comprised of multiple nodes, each of 
which is a complete computer. That is, each node 
has a set of resources (dedicated I/O, memory, 
and CPU) and is under the independent control 
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of its own operating system. 

Message-passing application interface. Application pro¬ 
grams access the iPSC/2’s message-passing system through 
system calls. 1113 The NX/2 operating system provides three 
levels of system calls: synchronous, asynchronous, and inter¬ 
rupt-driven. Users can mix and match these three types of 
calls. For example, a message sent with an asynchronous call 
can be received either with a synchronous call or an inter¬ 
rupt-driven call. The size of the message is limited only by 
the amount of physical memory available at the node. 

The processor uses an application-development message¬ 
passing paradigm. Consider, for example, a node program 
that receives data from the host or from another node and 
then locally processes them. If the node cannot locally pro¬ 
cess without the received data, then it must use a synchro¬ 
nous receive call to receive the data. 

On the other hand, if the node can accomplish some of 
the local processing without the received data, then it uses 
an asynchronous receive. After initiating the asynchronous 
receive, the node carries out the local processing that does 
not require the received data. The node program then checks 
if the data have been received. If the data have not been 
received, the program waits. 

The synchronous calls are also called blocking calls. When 
a node sends a message using this protocol, it is blocked for 
processing until the operating system copies the send buffer 
into its local buffer and is ready to send this message to its 
destination. This operation does not imply however that the 
message sent is received at the destination node. The mes¬ 
sage buffering and flow control in NX/2 ensure the synchro¬ 
nous transmission intended by the application. 

The asynchronous calls are nonblocking calls. When a node 
sends a message using this protocol, the sending node is free 
to process soon after the execution of that call. The receiving 
node must poll for any messages received asynchronously. 
This protocol facilitates the application to perform other tasks 
while the messages are in transit. 

Message-passing measurements. We conducted the fol¬ 
lowing experiments to analyze the communication behavior 
of iPSC/2’s message-passing system. We measured the com¬ 
munication times between two adjacent nodes that are also 
at the extreme diagonals on the hypercube. The communica¬ 
tion times for a one-hop (adjacent nodes) and multihop 
(nonadjacent nodes) are almost equal, because the interme¬ 
diate nodes take only a few microseconds to help establish a 
path. Thus, their computation phase is uninterrupted. This 
indicates that the iPSC/2 message-passing interconnection, 
though organized as a hypercube topology, practically func¬ 
tions as a fully connected interconnection network. 

A fully connected 128-node system has eight simultaneously 
available paths, that is, paths without contention. However, 
If multiple nodes simultaneously send messages to a com¬ 
mon destination, then contention is imminent. To evaluate 




Message size (Kbytes) 


Figure 1. Communication time as a function of message 
size: small (a), large (b). 

the effects of contention we measured the communication 
times when one or more source nodes send messages to a 
single receiving node. The communication times at the source 
nodes slow down proportionally to the number of source 
nodes-increases. 

In our experiments, we used an eight-node system and 
measured average communication time of 1,000 messages in 
which each node sends to and receives from another node. 
Figure 1 shows the communication time for varying message 
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Figure 2. Communication bandwidth as a function of mes¬ 
sage size: small (a), large (b). 

sizes between neighbor nodes and nodes three hops apart. 
Figure 2 illustrates the computed bandwidths of the commu¬ 
nication channel for different operating conditions. Note that 
the communication time for longer messages is comparable 
to that of smaller messages. Consequently, we prefer imple¬ 
mentations requiring fewer but longer messages. 

We measured the communication time for a zero-byte mes¬ 


Source node Destination node 

Establish path and transmit data 

-► 

(a) 

Source node Destination node 

Establish path and transmit data 
1 ---► 

Transmit status acknowledgement ^ 

Transmit data 

3 -► 

(b) 


Figure 3. Message transmission protocols: one-step (a), 
three-step (b). 


sage at 361 |is. This is the minimum setup time required to 
transmit a message of any size. In addition to the sertip time, 
messages larger than zero bytes need a propagation time. 

The system uses two protocols for message transmission. 
For short messages (zero to 100 bytes), a one-step protocol, 
shown in Figure 3a, sets up a path and transmits data 
concurrently. 

Figure la shows that, for messages larger than 100 bytes, 
the communication time increases by almost 280 Jis. (Since 
all messages include a 4-byte boundary, a 100-byte message 
actually takes 104 bytes.) The time increases because the 
system uses a three-step protocol, shown in Figure 3b, for 
larger messages. In the first cycle, the source node estab¬ 
lishes a path by sending a status message to the destination 
node. In the second, if the destination node is ready to re¬ 
ceive a large message, it sends back a status acknowledg¬ 
ment. In the final step, the source node sends the complete 
message to the destination node. 

This three-step protocol increases the communication time 
for large messages, but two different protocols are required 
since the one-step protocol could overflow the memory buffer 
if used for large messages. Alternatively, the three-step method 
would add unnecessary protocol overhead if used for small 
messages. 

Figure lb shows the communication time chart for mes¬ 
sages between 236 and 4,096 bytes. Note that communica¬ 
tion time for extreme diagonal nodes and adjacent nodes is 
almost equal. Figure 2a shows the bandwidth chart for mes¬ 
sages from 0 to 256 bytes. Figure 2b shows the bandwidth 
chart for messages in the range from 256 bytes through 64 
Kbytes. To achieve a 2.7-Mbyte/s bandwidth, the message 
size must be at least 64 Kbytes. For small messages, say 100 
bytes, the bandwidth is 0.21 Mbytes/s. 
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Our experiment led us to the following conclusions: 



Figure 4. Number of common destinations versus time. 


We also measured communication times for message trans¬ 
missions with contention in the network. Contention is cre¬ 
ated by transmitting messages from one or more source nodes 
to a common destination node. To demonstrate, we chose 
an eight-node hypercube with an arbitrarily chosen destina¬ 
tion node 5. Nodes 0 through 4, 6, and 7 transmit messages 
to node 5 and the transmission times are recorded. 

Figure 4 shows the number of common destinations ver¬ 
sus message communication time. Each curve represents a 
different message size. The message communication time 
increases almost linearly, as the number of common destina¬ 
tions increase. For example, sending a 100-byte message from 
a single source node takes 501 (is, while for seven source 
nodes, the communication time is 3,006 |is. For every 100 
bytes of memory, the communication time increases approxi¬ 
mately 400 (is for each common destination node added. 
Similarly, for a 104-byte message, the communication time 
increases approximately 800 (is for each common destina¬ 
tion node. (The dramatic increase for a 104-byte message is 
due to the three-step protocol.) 

We also observed that the communication time increase 
caused by contention is independent of the location of the 
common destination node. This increase stems from the com¬ 
munications technology employed in the iPSC/2. These mea¬ 
surements suggest that, to achieve scalable performance, we 
should use algorithms that avoid network contention. 


• Designers should choose an appropriate message size 
based on the requirements of an application. The three- 
step protocol causes extra overhead in the transmission 
of messages. Thus, if the message size is only slightly 
over 100 bytes, it may be advantageous to reduce the 
message size. 

• For short message tra nsm issions, the iPSC/2 offers a qu ite 
low bandwidth, on the order of Kbytes. Thus, message 
clustering may be required to achieve higher perfor¬ 
mance. In message clustering, one or more messages 
destined for the same target node are collected in a packet 
and transmitted as a single message packet. However, 
message clustering increases processing overhead. For 
some applications, this increased overhead may nullify 
the benefits of using the high bandwidth. 

• The iPSC/2 effectively supports a fully connected network. 
With the notable exception of minimizing node conten¬ 
tion, there are few additional benefits to partitioning the 
application to suit the adjacent node communications. 

• Simultaneous data transfer to the same node dramati¬ 
cally increases the communication time. Consequently, 
we prefer algorithms that avoid such contention. 

We designed the algorithms presented in the following sec¬ 
tions with these communication constraints in mind. 

Relational database 

Numerous commercial relational database systems are avail¬ 
able that execute on a range of host processing systems in¬ 
cluding personal, mini- and mainframe computers, as well as 
a host of multicomputer environments. Each system’s perfor¬ 
mance depends on the underlying execution environment 
and the liberties taken by the software developer. For ex¬ 
ample, duplicate tuples (or rows) are not allowed in the ac¬ 
tual relational database model, 14 and yet several vendors 
support duplicate tuples. We initially describe a subset of 
relational operators and then provide a discussion of three 
duplicate removal algorithms that can be employed as part 
of the implementation of these operators. 

Operators. iTie relational database model is the underly¬ 
ing environment for a wide diversity of applications. For ex¬ 
ample, medical pictorial databases, commonly called picture 
archiving and communication systems (PACS), employ the 
relational model to store and access patient data. 15 Some de¬ 
signers propose protocol verification systems that use the rela¬ 
tional model to implement an efficient algorithm based on a 
reachability analysis technique. 16 Thus, the development of 
high-performance parallel algorithms for the relational opera¬ 
tors impacts not only the traditional consumer database pro¬ 
cessing arena but also nontraditional database domains, 
including the communications and medical communities. 
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Table 1. 

Relations r and s. 


Relation r 


Relation s 


Employee 

number 

Aqe 

Employee 

number 

Children 

Gender 

0 

33 

0 

3 

M 

1 

25 

1 

0 

M 

2 

53 

5 

3 

F 

3 

25 

7 

2 

F 

4 

25 

8 

1 

F 

5 

45 

10 

4 

M 

6 

28 

16 

8 

F 

7 

45 

17 

4 

F 

8 

64 

18 

2 

M 



22 

1 

F 



28 

0 

M 



31 

0 

F 



42 

3 

F 



43 

2 

M 


To review relational database nomenclature, consider the 
relational database shown in Table 1. This database consists 
of two relations, r and 5. Relation r has two attributes: em¬ 
ployee number and age. Relation 5 has three attributes: em¬ 
ployee number, number of children, and gender. 

Each attribute is defined over some finite or countably in¬ 
finite domain. In this example, the attributes in rare defined 
over the set of whole numbers and the set of whole numbers 
between 0 and 120. The attributes in 5 are defined over the 
domains of the set of whole numbers, the set of whole num¬ 
bers between 0 and 30, and the set {M, Fl. Relation r consists 
of nine tuples; each tuple contains two attributes: 

{<0, 33>, <1, 25>, <2, 64>, <3, 25>, <4, 25>, <5, 53>, 

<6, 28>, <7, 45>, <8, 64>}. 

Similarly, 5 consists of 14 tuples, each comprised of three 
attributes. 

Several popular operators exist for the manipulation of the 
relations comprising the database. Here, we briefly overview 
only four: Project, Union, Select, and Join. Formally speci¬ 
fied, these operators are defined as follows. 

• Project. The projection on P\XY2\, denoted as n A (P), is 
defined by n A (P) = { x[A] I x e P], where A is a set of 
attributes of P. 

• Union. The union of two relations, P[XYZ\ and Q [XY23, 
denoted as P\XYZ\ u Q\XY2 \, is defined by P\XY7\ u 
Q[XY2\ = { x I xe Por xe Q 1, where X , Y, and Zare 


a disjoint set of attributes. 

• Select. The selection on P\XYZ\, denoted as a#©/, (P) 
where © e {<, < =, >, >, *( is defined by a^©^ (P) = {x 
I x[B] = b, xe P }, where B is an attribute of P. 

• Join. The join of two relations P[XYZ\ and Q [VWX\, de¬ 
noted as P[XY7\ 1*1 QlVWXl, is defined by P[XYZ[ 
1*1 Q [ VWX\ = { u I p e P, qe Q, p\X\ = q[X\, u [XYZ} 
= p, u [ VWX\ = q }, where V,W, X, Y, and Z are a disjoint 
set of attributes. If no common joining attributes exist, 
the join of P and Q is the Cartesian product of P and Q. 

Given relations rand 5as shown in Table 1, the user re¬ 
quests, or queries, are evaluated in Figure 3. 

In typical database applications, most attributes contain du¬ 
plicate values. Duplicates occur frequently in database pro¬ 
cessing due to the inherent nature of the data being processed. 
For example, George Mason University has about 20,000 stu¬ 
dents in 43 undergraduate programs. Therefore, the program 
discipline attribute in the student database contains a high 
degree of duplicates, that is, a low uniqueness factor. 


1. Find all employees with three children. 


^Children=3 (s) — 



Employee number 

Children 

Gender 

0 

3 

M 

5 

3 

F 

42 

3 

F 


2. Find all employees that are childless or are 25 years old. 

^Emp No. (^Children=0 (^)) ^ ^Emp No. (^Age=25 M) 

Employee Number 
1 

3 

4 
28 
31 

3. Determine all the available information about 45-year-old 
women. 

^Age=45 (t) Ixl CJ(3 enc j er _p (s) 


Employee number 

Age 

Children 

Gender 

5 

45 

3 

F 

7 

45 

2 

F 


Figure 5. Queries. 
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If an attribute value in the local database is also present in the received packet, then 
{ 

if (local_count > packet_count) then 

locaLglobal_count = local_global_count + packet_count 
else if (local_count < packet_count) then 

delete the tuple from the local database 
else if (local_count = packet_count) then 
{ 

if (local_node_address > packet_node_address) then 

local_global_count = local_global_count + packet_count 
else 

delete the tuple from the local database 



Figure 6. Multiple-message ring algorithm (local processing). 

In the database presented in Figure 5, duplicate entries ap¬ 
pear in the number of children and employee gender attributes 
of s, and in the age of the employees attribute in r. In fact, in 
most cases, only the key attributes are typically unique. We fo¬ 
cused on designing algorithms that remove duplicates. 

We also studied the effects of low uniqueness on the perfor¬ 
mance of proposed parallel algorithms. Other authors have 
written on the computational complexity and processing re¬ 
quirements of relational database queries in the presence of 
duplicate attribute values . 7,17 ' 8 

Of the relational database operators presented earlier, we 
emphasize Project and Union since both can result in duplicate 
tuple values in the initial stage of processing. That is, the Project 
operator initially eliminates columns which may result in dupli¬ 
cate tuple values. Thus the removal of duplicates from the gen¬ 
erated set of tuples must follow. 

The Union operator eliminates the duplicate copies that re¬ 
sult from the initial processing stage. In the Union operator, the 
initial processing stage involves the combining of the tuples of 
the first relation with the tuples of the second relation. Dupli¬ 
cate removal comprises the second stage. (In a distributed- 
memory architecture, the first stage can occur without any 
internode communication, whereas, in the second stage, all 
nodes must communicate.) The duplicate tuples belong to the 
intersection of the input relations to the Union operation. 

Algorithms for duplicate removal. We describe three 
parallel duplicate removal algorithms, the multiple message 
ring (or, simply, the ring), recursive reduction, and bucket algo¬ 
rithms. Each algorithm removes the duplicates in two steps: lo¬ 
cally at each node, and then globally throughout the 
hypercube. 

All three algorithms commence with the local duplicate re¬ 
moval phase. This phase forms a local database of tuples with 
unique attribute values originally available at the given node 
(that is, no two tuples have the same attribute value). For each 


unique datum value, two fields 
are maintained. One is a local 
repetition count representing the 
number of copies of the datum 
initially present at the node. The 
other is a global repetition count 
that is initialized to the corre¬ 
sponding local repetition count 
before the global duplicate re¬ 
moval stage. 

Since the local duplicate re¬ 
moval phase is identical in all 
three algorithms, we will not 
elaborate on it. We will briefly 
describe the global duplicate re¬ 
moval phase for each algorithm, 
though further details of the al¬ 
gorithms, including analytical 
models and performance evaluation, are beyond the scope of 
this article . 7 

Multiple-message ring. After the local duplicate removal 
phase, a unidirectional ring is embedded using Reflexive Gray 
Codes in the log 2 Aklimensional hypercube. A packet consist¬ 
ing of tuples with locally unique attribute values and their local 
repetition counts is formed at each node. The global duplicate 
removal is achieved in A'-l iterations. During each iteration, 
each node sends a packet to the next node, in a pipeline order¬ 
ing of the Reflexive Gray Code, and receives one from the pre¬ 
vious node. The received packet updates the local database as 
shown in Figure 6. This procedure repeats for TV-1 iterations, 
after which the following are true: 

• Attribute values are unique across node boundaries. 

• The node on which an attribute value is found is the node 
which had the largest local repetition count to start with. If 
more than one node had the largest local repetition count 
to start with, then the node with the largest node address 
will retain that unique value. 

• The final local_global_count of an item is the global rep¬ 
etition count of that value. 

Recursive reduction. The global duplicate removal phase is 
achieved by a log,A L fold recursive reduction of the size of the 
cube containing the data, with the final results placed at a 
single target node. In each step j, the size of the active cube is 
halved based on the value of the /th bit in the node address. 
That is, all nodes with the jth bit equal to one belong to the first 
half while the remaining nodes, those nodes with the jib. bit 
equal to zero, belong to the second half. 

For each node in one half, a corresponding node exists in 
the other half, where every bit except j is identical in value. 
Nodes that differ from the target node in the 7 th bit transmit lo¬ 
cal data to their corresponding node. The receiving nodes 
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If a datum in the packet is also present in the local database then 
local_global_count = local_global_count + packet_count 
else 

add the datum value to the local database at the appropriate location. 


Figure 7. Recursive reduction algorithm (local processing). 


Let N = Number of nodes, k = log 2 N, Node address = «n k1 n k _ 2 ... n 0 » 

Let max_value and min_value define the range of attribute values to be sorted. 
Let recursion_no = 0; 
while (recursion_no < k) 

{ 

create lower bucket consisting of attribute values such that 

attribute value < (Maximum attribute value - Minimum attribute value) 

2 

create upper bucket consisting of remaining attribute values. 

Let X = «00 ... 010 ... 0» where the 1 is in the (recursion_no) th position. 
Let partner_node = «n k1 n k 2 ... n 0 » BIT_WISE_EX_OR X 

^ ( n recursion_no = 

{ 

send upper bucket to the partner_node 

receive bucket from partner_node and merge it with the lower bucket 
max_value = min_value + max_value - min_value 

2 


else 


send lower bucket to the partner_node 

receive bucket from partner_node and merge it with the upper bucket 
min_value = min_value + max_value - min_value 

2 

} 

recursion_no = recursion_no +1 

} 


whose/h address bit equals zero keep 
all tuples whose attribute values map 
to the lower attribute value range and 
transit the remaining tuples to their 
corresponding node. Nodes whose/h 
address bit equals one keep all tuples 
whose attribute values map to the up¬ 
per attribute value range and transmit 
the remaining tuples to their corre¬ 
sponding node. Figure 8 outlines the 
pseudocode description of the bucket 
algorithm. 

Communication requirements. 

We designed these three algorithms 
to exploit our knowledge of the iPSC/ 
2 ’s message-passing system and to 
prevent simultaneous transmission to 
a node. In general, the intemode com¬ 
munication neecLs of the algorithms are 
restricted to adjacent nodes. 

Unfortunately, the volume of data 
transferred between nodes is data de¬ 
pendent. Hence, it is difficult to maxi¬ 
mize the sustained communication 
bandwidth between nodes by using 
large packets. When multiple packets 
are required to transfer the data be¬ 
tween the nodes however, we prefer 
fewer large packets to many small 
packets. 

Performance evaluation 

The time required for duplicate re¬ 
moval indicates the efficiency of the 
algorithm. The total time consists of 
three components: 


Figure 8. Pseudocode of bucket algorithm. 

merge their respective local databases with the packets re¬ 
ceived, as shown in Figure 7. On termination, all the unique 
values and their global counts reside in the target node. 

Bucket. This algorithm is a special case of the hash join 
algorithm 19 and can be used whenever the range of attribute 
values in the system is known a priori or can be readily and 
efficiently computed. A bucket consists of attribute values 
within a specified range. Each node is mapped to a bucket, 
and the attribute values residing on all other nodes belonging 
to that bucket are routed to that node. The routing is done in 
log 2 ./V steps by following an algorithm similar to the recursive 
reduction algorithm. During each step /' corresponding nodes 
partition the attribute values range into two halves. Nodes 


. - • local duplicate removal, 

• global duplicate removal, and 

• data collection. 

The ring and bucket algorithms distribute the resulting data 
across all the nodes. If the results must reside at a single 
node, they must be transmitted. The time needed to accom¬ 
plish this is called collection. 

The inclusion of collection time and the duplicate removal 
phase, and the proper selection of the algorithm is applica¬ 
tion dependent. For example, if the starting data are already 
locally unique, the local duplicate processing time need not 
be included. Similarly, if the final results are to be used as an 
input for further processing, the collection time should not 
be included in the execution time. Unless otherwise stated, 
we assume that the local duplicate removal time is included 
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and the collection time is not included for computing the 
total execution time of an algorithm. 

We carried out experiments by varying the work load char¬ 
acteristics and the system configuration, namely the number 
of nodes. The cardinality, uniqueness, tuple size, distribution 
of the unique values, and distribution of data among the 
nodes define the work load characteristics. We define the 
uniqueness factor (or simply the uniqueness) of the data as 
the number of distinct attribute values expressed in a fraction 
of the total number of attribute values. In our experiments 
the attribute value of a tuple is taken as an integer. We refer 
to this integer value as the data value. 

As an example, a typical run may consist of eight nodes, 
8,000 data values, and a uniqueness of 0.4. The database 
randomly generates the required data and then distributes 
the data among the nodes, either uniformly (each node re¬ 
ceives the same number of data values) or nonuniformly. 

We used each of the three algorithms for parallel duplicate 
removal and recorded the execution times. The execution 
times may include the local duplicate removal time and the 
collection time, as described earlier. Since the data are gener¬ 
ated randomly, we expect the execution time of an algorithm 
to change every time the experiment is carried out. To ac¬ 
count for the random behavior, we conducted the experi¬ 
ments multiple times and observed the deviation of the 
execution times of an algorithm. 

We found a small deviation in the execution times for all 
the runs except those with nonuniformly distributed data 
among the nodes. Therefore, for the data distributed uni¬ 
formly among the nodes, we took the average of five runs as 
the true execution time of an algorithm. For the nonuniformly 
distributed data, we took the average of 25 ains of each 
algorithm as the execution time. 

Our objective was to evaluate the performance of each al¬ 
gorithm under different work load and system conditions with 
the eventual goal of dynamically selecting 'the algorithm of 
choice for each application. Tire independent variables were 

• the number of nodes, 

• the number of data elements, 

• the size of each element (or tuple), 

• the uniqueness of the data (degree of skew), 

• the initial distribution of the data, and 

• the desired final distribution of the data (distributed or 
centralized). 

Note that the choice of the optimum algorithm must be de¬ 
termined not only by the performance evaluation but also by 
the constraints on the initial data and final results. 

Figure 9a shows the execution times (including the local 
duplicate removal times) of the three algorithms and an opti¬ 
mal algorithm as a function of the uniqueness. The system 
consisted of 16 nodes and 16,000 data values. The optimal 






Figure 9. Execution times including the local duplicate re¬ 
moval times: 16 nodes, 16,000 data values (a); 16 nodes, 
8,000 data values (b); 8 nodes, 16,000 data values (c); and 
8 nodes, 8,000 data values (d). 
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algorithm using N nodes requires (1/A0th the time taken by 
an optimal uniprocessor algorithm (a tree duplicate removal 
algorithm, in our case). 

Note that for the bucket algorithm the speedup in the ex¬ 
ecution time is about N/2 when /Vnodes are used. Moreover, 
the speedup is the same for all values of uniqueness. The 
speedups for the ring and recursive reduction algorithms are 
small for high uniqueness values. But the speedups increa.se 
as the uniqueness decreases. 

Figures 9b, 9c, and 9d show similar plots for 16 nodes, 
8,000 data values; 8 nodes, 16,000 data values; and 8 nodes, 
8,000 data values. The bucket algorithm is insensitive to the 
uniqueness of the data, except for very low uniqueness val¬ 
ues. The recursive reduction is the most sensitive, and the 
ring algorithm is moderately sensitive to the uniqueness. 

The bucket algorithm perfonns fastest for all except the 
very low values of uniqueness. For completely unique data 
(a uniqueness factor of 100 percent), the recursive reduction 
algorithm takes almost twice as long as the ring algorithm. 
But the performance of the recursive reduction significantly 
improves as the uniqueness decreases, due to its sensitivity 
to the uniqueness of data. At lower levels of uniqueness, the 
recursive reduction actually performs better than the ring. 
This threshold value of the uniqueness changes with the sys¬ 
tem and data conditions, namely the number of nodes and 
the number of data values. For example, for 16 nodes and 
16,000 data values, the threshold uniqueness is 0.28, whereas 
for 16 nodes and 8,000 data values, it is 0.31. 

Three components of the execution time explain sensitiv¬ 
ity towards uniqueness. 

• Local duplicate removal. Consider a database of size 
Afwith a global uniqueness p. Divide the database into 
N parts and let the local uniqueness of each part be q. 
We see that, in general, q will be larger than p. In fact, 
for p» 1/Nj q~l. For example, for M= 16,000 and N= 
16 , if p is reduced from 1.0 to 0.1, q remains at 1.0. 
Consequently, the local duplicate removal time is not 
sensitive to the global uniqueness p, especially for large 
values of p. 7 

• Global duplicate removal. This component consists 
of two parts: 

Data transfer. In the bucket and ring algorithms, the local 
uniqueness factors determine the size of packets to be 
transferred. For both algorithms, the packet size is not 
affected by change in the global uniqueness p. On the 
other hand, in recursive reduction, the size of the cube 
decreases as the recursion progresses and the local 
uniqueness changes from 1.0 (in the first recursion) to p 
(in the last recursion step). As a result, the size of the 
packets to be transferred increases from the first to the 
last recursion and is sensitive to the global uniqueness. 
Local processing. For the bucket algorithm, the size of 


the received packet and the size of the local database 
are unaffected by the change in the global uniqueness. 
As a result, the local processing time of the bucket algo¬ 
rithm is insensitive to global uniqueness. For the ring 
algorithm, the size of the received packets and the initial 
size of the local database do not depend upon the glo¬ 
bal uniqueness. As the iterations progress however, the 
local database shrinks. The rate of this decrease depends 
on the global uniqueness. The lower the global unique¬ 
ness, the faster the local database shrinks. Hence, the 
local processing time is moderately sensitive to the change 
in the global uniqueness. For the recursive reduction 
algorithm, as explained earlier, both the size of the re¬ 
ceived packets and the size of the local database are 
sensitive to the global uniqueness. Consequently, the 
local processing time of the recursive reduction is highly 
sensitive to the global uniqueness. 

Depending on the particular application at hand, the local 
duplicate removal time may or may not be important. If this 
time is not taken into account, the execution times of the 
three algorithms decrease by a fixed amount. Since the local 
duplicate removal time changes with uniqueness, the decrease 
in the execution time-though exactly the same for all three 
algorithms-differs for each uniqueness. Even if we do not 
consider the local duplicate removal time, the threshold value 
of the uniqueness, below which the recursive reduction al¬ 
gorithm performs better than the ring algorithm, does not 
change. 7 

The performance of the three algorithms for very low 
uniqueness values, shown in Figure 9, is of interest. Figure 10 
shows execution times of the three algorithms for uniqueness 
varying from 0.001 to 0.01. Note that the ring algorithm is the 
most time-intensive of the three. The recursive reduction per¬ 
forms better than the bucket for most uniqueness values. 

In the ring and the bucket algorithms, the final results are 
distributed among the different nodes. Such a distribution may 
be desired in various situations including intennediate rela¬ 
tions within the execution of a query. In many applications 
however, the final results are needed in one node. In such 
cases, for the ring and bucket algorithms the distributed results 
must be collected and merged at one node. Note that the 
recursive reduction algorithm already includes the collection 
time. 

Figure 11 shows the execution times including collection. 
The ring algorithm perfonns slowest. The bucket algorithm’s 
perfonnance degrades but remains better than that of the re¬ 
cursive reduction algorithm, especially for high uniqueness 
factors. For uniqueness values below a threshold, recursive 
reduction is better than bucket. The value of this uniqueness 
threshold depends on the number of nodes and the data size. 

We have assumed that the data with particular uniqueness 
are generated randomly and that the distribution of different 
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Figure 10. Execution times for low uniqueness, including 
the local duplicate removal times: 16 nodes, 16,000 data 
values (a); 16 nodes, 8,000 data values (b); 8 nodes, 16,000 
data values (c); and 8 nodes, 8,000 data values (d). 


Figure 11. Execution times including collection: 16 nodes, 
16,000 data values (a); 16 nodes, 8,000 data values (b); 8 
nodes, 16,000 data values (c); and 8 nodes, 8,000 data val¬ 
ues (d). 
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Figure 12. Execution times for nonuniform tuple counts: 

16 nodes, 16,000 data values (a); 16 nodes, 8,000 data val¬ 
ues (b); 8 nodes, 16,000 data values (c); and 8 nodes, 8,000 
data values (d). 


unique values in the data is uniform. This assumption may not 
always be true. We studied the effect of a skewed distribution 
of the unique data values on the algorithms by generating the 
data values using a nonuniform distribution. Figure 12 shows 
the execution times of the three algorithms when we gener¬ 
ated data using a Gaussian distribution with p equaling the 
mean of the unique values to be generated and a equaling 50. 

As expected, the execution times of all three algorithms 
remain unchanged for high uniqueness. For low uniqueness, 
the execution times of the ring and the recursive reduction 
algorithms decrease. This is not surprising because, if the dis¬ 
tribution of the unique values is skewed, the local uniqueness 
factor on each of the nodes is generally smaller than the ones 
if the unique values were not skewed. Smaller local unique¬ 
ness implies that the local as well as the global duplicate re¬ 
moval times are smaller. 

The perfonnance of the ring and recursive reduction algo¬ 
rithms for the uniformly generated data are the worst case 
perfonnance for a specified uniqueness as far as the skewness 
of the unique values is concerned. In the case of the bucket 
algorithm, the local duplicate removal time decreases, as it does 
for the ring and recursive reduction algorithms. 

The global duplicate removal time increases because the 
design of the bucket algorithm assumes a uniform distribution 
of data. If the distribution of the data is skewed, the size of each 
bucket will differ. As a result, after a few iterations some nodes 
will be processing and transferring large amounts of data, 
whereas others may be operating on small amounts of data. 
The discrepancies among the volume of the data on different 
nodes increase with each iteration. Hence, in the bucket algo¬ 
rithm, the load-sharing among the nodes is not uniform if the 
data are not generated uniformly. This increases the global 
duplicate removal time of the bucket algorithm. 

The total execution time will either decrease or increase 
depending on the sum total of the changes in the local and 
global duplicate removal times. Comparing Figures 9 and 12, 
we see that the total execution time remains almost constant 
for the bucket algorithm. 

We assumed the distribution of the initial data among the 
nodes was uniform. This, too, may not be always the case. To 
study the effect of nonuniformly distributed data on the three 
algorithms, we uniformly generated data but distributed it 
nonuniformly to the nodes as follows: 

• Generate a random number between 0 and Abusing a uni¬ 
form random number generator. 

• Let the number be x^. 

• Randomly select Xq data values and assign them to node 
0 . 

• Now generate a random number between 0 and (N-x^) 
using a uniform random generator. 

• Let that number be x l . 

• Randomly select x l data values and assign them to node 1. 
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• Repeat this procedure to assign the data to the remaining 
nodes. 

• Assign to the last node all data not assigned to any other 
node. 

As can be expected, die execution times of the three algorithms 
change substantially every time new data are generated using 
this procedure. Therefore, we repeated the procedure 25 times 
instead of the normal five times. Figure 13 shows the results 
on the nonuniformly distributed data. Note that the execution 
of all three algorithms increases by a factor of 2 to 3. However, 
the bucket algorithm remains the algorithm of choice. 

Another important issue related to the initial data is the tuple 
size. So far we have assumed that the data consist of only the 
key, which is assumed to be an integer value. In reality, a tuple 
consists of a key along with other fields of varying sizes. In 
such cases, the duplicates are removed by appending the other 
fields of a duplicated tuple. For example, assume that a tuple 
consists of a key and a 10-byte field. If the data contain 50 
duplicates of a key value, after the duplicates are removed, the 
key value will occur only once along with 50 fields of 10 bytes 
each. 

Figure 14 on page 54 shows the execution times of the three 
algorithms for varying tuple sizes. Note that the execution times 
of all three algorithms increase by an order of magnitude. The 
recursive reduction is the most sensitive to the tuple size and 
clearly performs slowest for large tuple sizes. The sensitivity 
of the recursive reduction algorithm to the tuple size is con¬ 
sistent with its sensitivity to the uniqueness of p and is ex¬ 
plained by examining the three components of the execution 
times, as discussed earlier. Table 2 on page 55 summarizes 
our results. 

THE EFFICIENT EXPLOITATION OF PAR ALLE LISM in a dis¬ 
tributed memory architecture necessitates the development 
of algorithms designed for the underlying execution environ¬ 
ment. Proper algorithmic design accounts for both the com¬ 
putational and communicational demands of the application. 

Our study focused on algorithms designed for duplicate 
removal on a hypercube database system. We proposed, 
implemented, and evaluated three algorithms. We then used 
the results obtained by a performance evaluation of these 
algorithms under varying experimental parameters to deter¬ 
mine the optimality of a given algorithm for a particular task. 

Modeling the three algorithms and predicting their perfor¬ 
mance for a given system and load characteristics simplifies 
the determination of the algorithm of choice. 7 Another ad¬ 
vantage of using the analytical models is that the results can 
be generalized to different-size iPSC/2s, as well as to other 
types of hypercubes provided certain system parameters— 
namely, the processing and the communication speeds—are 
known a priori. 






Figure 13. Execution times for data distributed nonuni¬ 
formly among the nodes: 16 nodes, 16,000 data values (a); 
16 nodes, 8,000 data values (b); 8 nodes, 16,000 data val¬ 
ues (c); and 8 nodes, 8,000 data values (d). 
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(d) Uniqueness 


Figure 14. Execution times for different tuple sizes (16 
nodes, 16,000 data values): 4-byte tuples (a); 14-byte 
tuples (b); 24-byte tuples (c); 104-byte tuples (d). 


The distributed database is a special case of a multicomputer 
in which the communication bandwidth is lower than a tightly 
coupled multicomputer. Therefore, the analytical models of 
the three duplicate removal algorithms can be generalized to 
distributed database design as well. 

We plan to design a dynamic query optimization scheme 
that incorporates the results of our study in determining how to 
best process a given relational operator within a given query. 
By examining the degree of uniqueness within the input rela¬ 
tions and selecting the corresponding optimal duplicate re¬ 
moval algorithm, we intend to reduce the total query 
processing time. The cost of determining the optimal algorithm 
is an important consideration that needs to be addressed. IB 
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Table 2. Summary of the algorithm of choice. 





Algorithm of choice 
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N=16 

M=16,000 

N=16 

M=8,000 

N=8 

M=16,000 

N=8 

M=8,000 

Initial data not 

Range of data 

Bucket for p > 0.09 

Bucket for p > 0.04 

Bucket for p > 0.02 

Bucket for p > 0.02 

unique and final 
results may be 

Values known 

RR* for p < 0.09 

RR for p < 0.04 

RR for p < 0.02 

RR for p < 0.02 

distributed 

Range of data 
Values not 
known 

Ring for p > 0.34 

RR for p < 0.34 

Ring for p > 0.32 

RR for p < 0.32 

Ring for p > 0.25 

RR for p < 0.25 

Ring forp > 0.3 

RR for p < 0.3 

Initial data 

Range of data 

Bucket for p > 0.1 

Bucket for p > 0.04 

Bucket for p > 0.02 

Bucket for p > 0.02 

unique and 
sorted, and final 

Values known 

RR for p < 0.1 

RR for p < 0.04 

RR for p < 0.02 

RR forp < 0.02 

results may be 

Range of data 

Ring for p > 0.4 

Ring forp > 0.3 

Ring forp > 0.25 

Ring for p > 0.3 

distributed 

Values not 
known 

RR for p < 0.4 

RR for p < 0.3 

RR for p < 0.25 

RR for p < 0.3 

Initial data not 

Range of data 

Bucket for p > 0.1 

Bucket forp > 0.12 

Bucket for p > 0.13 

Bucket for p > 0.13 

unique and final 
results should 

Values known 

RR for p < 0.1 

RR for p < 0.12 

RR for p < 0.13 

RR for p < 0.13 

not be 
distributed 

Range of data 
Values not 
known 

RR 

RR 

RR 

RR 

Initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. 

Range of data 

Ring for p > 0.15 

Ring for p > 0.25 

Ring forp > 0.08 

Ring forp > 0.15 

Tuple counts 
not uniform 

Values not 
known 

RR for p < 0.15 

RR for p < 0.25 

RR for p < 0.08 

RR forp < 0.15 

Initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. Data 
not distributed 
uniformly on the 
nodes 

Range of data 
Values not 
known 

RR 

RR 

RR 

RR 

Initial data not 
unique and final 
results may be 

Range of data 
Values known 

Bucket 

Bucket 

Bucket 

Bucket 

distributed. Tuple Range of data 
size=29 bytes Values not 

known 

RR* Recursive reduction 
p Global uniqueness 

Ring 

Ring 

Ring 

Ring 
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Conformance Testing of VMEbus and 
Multibus II Products 


Designers working with the European Community’s Conformance Testing Services program 
have developed a system for testing VMEbus and Multibus II products. Conformance is neces¬ 
sary to allow and guarantee interoperability between products from different vendors. The 
EC’s test system is mainly automated to reduce costs and ensure impartiality. An overview of 
the system’s components familiarizes readers with its procedures. 
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onfomiance testing seeks to determine 
if a product or parts of a product (in 
our case, parallel bus interfaces) has 
been built according to the rules of 
the applied standard. 1 One way to check confor¬ 
mity is to examine the schematics, programmable 
logic array data, and PROM data, and compare 
the results with the rules of the standard. An¬ 
other way is to create a testing environment rep¬ 
resenting the standard and test the product against 
it. Dedicated devices of the test environment 
monitor the results of interaction. 

This second way requires a big investment in 
methodology, hardware, and software before the 
first product can be tested. But this approach’s 
advantage is that it produces test results that are 
objective, reproducible, and independent from 
the test engineer. A consortium of advisors chose 
the second method for the Bus Interface Con¬ 
formance Test (BICT) project. 


Test system 

The BICT system tests the conformity to stan¬ 
dards of bus interfaces of VMEbus and Multibus 
II boards and their backplanes. 2 - 3 The test system 
represents the standards and offers all the op¬ 
tions the standards give. This approach allows a 
board to be plugged into the test system where 
predefined automatic or semiautomatic proce¬ 
dures test the features of the unit under test. ‘ 


The bus standards specify four main categories: 

• timing and logical behavior of signals, 

• message-passing protocol (for Multibus II 
only), 

• electrical properties of boards and back¬ 
planes, and 

• mechanical dimensions of boards and 
backplanes. 

A specific configuration of the test system and its 
devices is required for each category. Figure 1 
on page 58 shows the general configuration of 
the BICT system, including the dial bench gauge, 
which is operated manually. 5,6 

A test system controller manages the activities 
of the test system devices and gives instructions 
to the operator, or test officer. For each test cam¬ 
paign the controller collects the measuring re¬ 
sults, stores them in a database, and generates a 
test report on the basis of collected data. The 
controller performs a series of functions, (see 
box on page 58) classified by type of action. 7 

In this article we explain a test campaign in 
the sense of a walk through the test system, in 
the w T ay a customer-such as a manufacturer, 
reseller, or original equipment manufacturer- 
would see it. We focus on testing bus interfaces 
because it is more complex than testing 
backplanes. 
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BICT 


IEEE-488 bus 



Dial bench gauge 
(operated manually) 


Figure 1. BICT system. 

Test system controller functions 

For the classifications listed here, the controller provides the following infonnation or performs the following tasks. In 
the case of manual tests, the operator provides the information, which the controller then analyzes. 


The test officer 

• Information about the actual test and test state 

• Instructions about mechanical tests to be carried 
through by the officer 

• Information about test results by immediate check 
against parts of the relevant standard 

• Requests test results and offers data entry' capabilities 

Test system controller 

• Start/stop logic analyzer 

• Start/stop pattern generator 

• Poll logic analyzer and digital storage oscilloscope on 
trigger or time-out if no triggering 

• Select and start test routine of the unit’s upper tester (via 
communication channel) 

• Upload data from test devices 

• Download data to test devices 

Digital storage oscilloscope 

• Measurement of signal edges and waveforms for 
electrical test 

• Measurement of voltage levels for static electrical test 


Logic analyzer 

• Run the trigger to monitor the logical and timing behav¬ 
ior of the unit 

• Store captured data in case of triggering 

• Access to database of trigger setups 

• Trigger on certain messages or bus states 

• Store captured message in case of triggering 

Pattern generator 

• Run sequence of patterns for bus operation initiation/ 
response 

• Control of delay lines for signal adjustment 

Power supply 

• Provide unit with current in the specified voltage range 

• Measurement of current consumed by unit 

Mechanical test devices (dial bench gauge, slide 
calipers) 

• Measurement of boards 

• Measurement of devices mounted on boards 
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BICS/BIXIT 

Before testing begins, a customer provides the following 
information about the unit on two DOS-based, electronic 
forms: 

• Bus interface conformance statement (BICS). The 

customer enters the capabilities and options of the stan¬ 
dard implemented in the unit. 

• Bus interface extra information for testing (BIXIT). 

The customer provides further details on the implemen¬ 
tation features of the unit, such as address ranges, inter¬ 
rupt levels, bus request levels, and message types. 

BICS/BIXIT is a menu-driven, interactive input program that 
automatically checks for completeness and consistency as 
the user fills in the form. It features dynamic data manage¬ 
ment and automatic data interfacing and recording. Figure 2 
depicts a BICS/BIXIT input screen requesting information 
about the implemented features of a memory slave. 

Chip-level testing is complex, time-consuming, and not cost- 
effective for the purpose of confonnance testing. Board manu¬ 
facturers use relatively few standardized silicon devices to 
implement the interface drivers, and sufficient and reliable 
information about their capabilities is available. We can there¬ 
fore avoid testing on the chip level. 8 

In lieu of such testing, the board manufacturer provides 
information about drivers, transceivers, and other devices that 
implement the bus interface, on another DOS-based form, 
the vendor claim. The vendor returns this form, along with 
the BICS and BIXIT data, to the center, where it is analyzed 
by a test program running on the controller. 

In the first step of testing, static con¬ 
formance review, the controller checks 
this information against a silicon database 
and the relevant rules of the standard. 


transmit via the IEEE 488 bus to the controller. The program 
determines if the read value is plausible and compares it with 
the requirements of the standard. 

The dial bench gauge consists of a base plate on which 
smaller plates are mounted, carrying a reference bolt for each 
position to be measured. The positions of these reference 
bolts vary for different board sizes (single, double, and triple 
European cards, 160 or 220 mm long). 

The tester fixes the unit on two clamping strips. Before 
taking the measurements, the tester samples a reference board 
to get the offset values. This procedure, called the calibration 
phase, delivers the offset factors, which are stored in the 
controller and integrated into the measurement calculation. 
Figure 3 on page 60 shows the configuration for the me¬ 
chanical test. 

Testers use a slide caliper for special cases where the depth 
of the gauge is not applicable. The rack fit test is a purely 
qualitative test, as postulated in the VME standards. The tester 
performs this test manually. 

Electrical test 

The electrical test verifies rules concerning requirements 
for electrical characteristics and behavior. The measurements 
are electrical, chronological, or a mixture of the two. 

One part of the electrical test measures electrical charac¬ 
teristics independent of special capabilities of the unit, such 
as connectivity, termination networks, and power consump¬ 
tion. The other part times signals or performs specific func¬ 
tions of the unit, measuring signal waveforms (rise and fall 
times, high and low times), signal levels, and power on/off 
behavior. 


File 


Edit 


= BICS/BIXIT Edit 
View Check 


Document 


Quit 


Mechanical test 

Test officers manually assist in the next 
step, testing whether the board fits into a 
standard rack and its backplane. A me¬ 
chanical test program running on the con¬ 
troller guides the tester and indicates the 
points to be measured. The mechanical 
test is based on two test methods: the 
outline test with the dial bench gauge and 
the rack fit test with the rack simulation. 

For the outline test, the dial bench 
gauge (see Figure 1) can be configured 
according to the controller’s instructions 
and the test officer mounts the unit board 
onto the gauge. The tester then applies 
the test tool to the measuring points asked 
for by the test program. The read values 



Figure 2. Input screen of BICS/BIXIT. 
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As with the mechanical test, the controller guides the op¬ 
erator through the electrical tests (see Figure 4). The test 
officers handle the test probes manually, since for each rule 
different test points must be checked. The obtained results 
are uploaded to the controller via an IEEE 488 interface, where 
a program compares them with the specification values and 
generates a statement for the report. 

The electrical test does not measure driving and receiving 
characteristics of the signal lines. Testing them would mean 
testing silicon and, as mentioned earlier, this exceeds the 
scope of conformance testing on the board level. Therefore, 
the test officers refer to the information provided on the ven¬ 
dor claim form to check these devices or assemblies. Opera¬ 
tors apply this method when items are not testable or may 
change from board to board. Items tested in this way include 
used drivers, receivers, and connectors. 

Timing protocol test 

The timing protocol test determines whether the interface 
modules of the unit communicate through the bus with other 
modules in accordance with the relevant timing protocol given 
by the test specification. (Interface modules include the in¬ 
terrupt handler, interrupt requester, and arbiter.) The system 
individually tests every module or joined modules of the unit. 


A pattern generator stimulates the module(s) under test 
through the bus, while a logic analyzer detects anomal (or, 
illegal) events on the bus (see Figure 5). The stimulating pat¬ 
terns created on the bus by the pattern generator, as well as 
the responses created on the bus by the module(s) under 
test, constitute a bus operation. 

The standard rules define the modules and their options, 
or capabilities. Modules can participate in different bus cycles 
according to their defined capabilities. For example, a VME 
master module with 16-bit data transfer capabilities may par¬ 
ticipate in single- and double-byte read-and-write data trans¬ 
fers, as well as block transfers. But it may not participate in 
Byte (0-3) data transfers. 

A module option may also define a cycle option. A VME 
arbiter module may resolve a request according to a stan¬ 
dard-specific algorithm (single, priority, or round robin) and 
a VME requester may release the bus according to one of 
two algorithms (release-when-done or release-on-request). 

Bus operation. Every bus operation is specified as a set 
of bus cycles and hence as a set of timing and data protocol 
test rules. Fart of the bus operation stimulates the test mod¬ 
ule. The other part is the response of the unit, which will be 
analyzed by the logic analyzer. 

For example, a CPU module reads data and writes it to a 



Figure 3. Mechanical test. 


Figure 4. Electrical test. 
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Figure 5. Timing protocol test: test of masters. 


memory board. First it runs an arbitration cycle to gain the 
bus. Next it runs a read cycle to fetch the data from the 
module. To write its data to the module, it performs a write 
cycle. Finally, the CPU module releases the bus. In this ex¬ 
ample the bus operation consists of the arbitration, read, and 
write-and-release cycles. In every cycle the CPU module 
(stimulus) gives signals that are driven by the memory mod¬ 
ule (response). 

Pattern generation. The signal relations, into which tim¬ 
ing rules are translated, define the limits of the stimulating 
rules. For worst-case timing protocol testing, we developed 
worst-case stimulating patterns. For basic interconnection tim¬ 
ing protocol testing, the system stimulates the unit using more 
tolerant limits. 

A pattern generation program, reads signal relations asso¬ 
ciated with a bus cycle as input information and translates 
them to stimulating patterns and wait statements for the pat¬ 
tern generator. Data protocol test rules are taken into ac¬ 
count by the pattern generation program, as needed. 

Real-time analysis. The real-time analysis method tests 
behavioral characteristics (or, functional tests) and verifies 
the relevant rules. Real-time analysis facilitates exhaustive 
testing and supports the BICT philosophy: 


Many complete bus cycles/operations perform consecu¬ 
tively. The logic analyzer monitors the signals on the 
bus in real time and only freezes data in its memory if a 
trigger condition is fulfilled. While cycles run, the con¬ 
troller downloads different trigger words to the logic 
analyzer. 

We decided to run the bus cycles/operations several times 
because 1) asynchronous behavior leads to deviations in re¬ 
action times, and 2) different states of the unit may cause 
different reactions. This method uses anomaly triggers, so 
that only in cases where the standard is violated does the 
logic analyzer upload the data to the controller for the test 
report. 

Control-and-status register test 

The control-and-status register set holds system informa¬ 
tion in a standardized way. During a system startup, the con¬ 
figuration program uses it for systemwide initialization, 
configuration, and diagnostics. 

The test for Multibus II checks the content, structure, and 
predefined functions of the control-and-status register. Since 
almost every item tested is a different case, we established an 
individual test procedure for each rule. During the test these 
procedures are executed one by one, according to the unit’s 
capabilities. 

Message-passing protocol test 

The message-passing protocol describes a communication 
and data exchange procedure between agents using well- 
defined messages. We call it “network in a crate” because it 
applies mechanisms similar to those used in LANs. One mes¬ 
sage packet is transferred using one sequential data transfer 
bus operation in the message address space. In general, we 
define two types of messages. 

• Unsolicited. An unsolicited message is a one-way trans¬ 
fer of a data packet to a destination address. The packet 
carries up to 28 bytes of data. Systems use unsolicited 
messages for interprocessor communication, very short 
data transfers, and virtual interrupts. Within one Multi¬ 
bus II system, we can define up to 255 message (inter¬ 
rupt) sources and 255 message (interrupt) destinations. 
The broadcast message, a special unsolicited message, 
is a simultaneous, one-way transfer of an unsolicited 
message to all available destination addresses in a 
system. 

• Solicited. A solicited message transfer is a specified, fi¬ 
nite exchange of data packets that transfers up to 16 
Mbytes of data. Before such a message is sent, the sys¬ 
tem negotiates to ensure there is enough free data buffer 
available at the destination and to define other transfer 
parameters. The solicited message fragments into small 
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data packets, each containing at least 32 bytes. The sys¬ 
tem then sends them over the bus, one at a time. 

The message-passing protocol test for Multibus II checks 
that the unit performs the sending and receiving of unsolicited 
and solicited messages correctly. For solicited messages, it 
also tests the negotiation of the transfer and the correct appli¬ 
cation of the negotiated parameters. In addition, the content 
of each message is checked. 

Since the message-passing protocol uses a bus protocol on 
signal level, the unit must pass the electrical and functional 
test (on signal level) successfully before operators perform 
the message-passing protocol test. 

To perform this test, the vendor installs an upper tester 
which substitutes for the software layer above the unit. Fig¬ 
ure 6 shows that a test routine runs through a communica¬ 
tion channel between the controller and the upper tester. 



Figure 6. Message-passing protocol test. 


BICT divides the message-passing test into two entities, 
each applying different analysis methods. 

• Messages sent. The controller uses two methods to ana¬ 
lyze outgoing messages. In most cases, the system uses 
a software analysis method. The Multibus II interface 
receives the unit’s messages, which are then analyzed 
by the controller. But in some cases, in which the logic 
analyzer must monitor the bus, the system uses a bus 
analysis method. This method measures, for example, 
the duty cycle during a solicited message transfer, or 
detects unexpected messages sent by the unit. 

• Messages received. Two methods can be used to test the 
message-receiving capability of a unit. In the on-board 
message verification method the upper tester analyzes 
the messages received by the unit and reports the test 
result to the controller. Since this will blow up the upper 
tester and with it the vendor’s investment in installing it, 
we prefer the message reflection method. Here, the mes¬ 
sages sent to the unit are reflected by the unit and ana¬ 
lyzed by the test system. 

To ensure that detected errors are caused by the unit’s re¬ 
ceiving capability and not by its sending capability, the unit 
must pass the sending capability before the test officer ap¬ 
plies the message reflection method. 

Test example 

To demonstrate how the conformance test is carried out in 
practice, we describe here a message-passing protocol test 
case for Multibus II. Rule 13.5.1.2, Section 5 of the Multibus II 
test specification states, “If an agent is not able to receive a 
requested solicited message, it must send a buffer reject mes¬ 
sage to the agent that has sent the buffer request message.” 
This rule checks the negotiation of a solicited-message trans¬ 
fer. The buffer reject message is an unsolicited message that 
sets up a solicited-message transfer. A system refuses the 
buffer request if, for example, a receiver does not have enough 
resources or the resources are, at that moment, not available. 

Verification procedure. To test this rule, we apply the 
software analysis method. At the beginning of the verifica¬ 
tion procedure the controller, via the communication chan¬ 
nel, starts a test routine of the upper tester installed on the 
unit. The selected test routine is responsible for receiving 
solicited messages. 

Then the controller, via the Multibus II interface, sends a 
buffer request message to the unit (see Figure 7). In this case 
the requested length of the solicited message, 10 Mbytes, is 
greater than the available local resources of the unit, 1 Mbyte. 
The size of the unit’s local resource has already been de¬ 
clared on the BICS/BIXIT form, so the controller can auto¬ 
matically calculate the requested length of the message. 

Now the controller waits until the Multibus II interface re- 
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ceives a message. If no message arrives 
within a specific time, the test of the rule 
has failed. If a message arrives, the con¬ 
troller verifies whether the message is a 
buffer reject. 

Results. At the end of the verification 
procedure the controller interprets the 
results and appends them to a log file. 
In the case of rule 13-5-1.2 section 5, it 
assigns a “failed” result if the unit sends 
no message or a message that is not a 
buffer reject. 

Test report. After the test, the center 
prepares a report (see Figure 8) and 
sends it to the customer. The report de¬ 
scribes the unit, the test suite, and all 
useful information concerning the ex¬ 
ecuted tests and their results. The report’s 
first section summarizes the number of 
tests applied and failed. A later section 
details the applied tests and the test se¬ 
quence. If a rule was violated, the re¬ 
port notes the error and gives the 
corresponding rule text. A description 
of the test method applied can be found 
in the test specification. Customers can 
obtain more detailed data on the test ex¬ 
ecution and results, such as captured 
waveforms or data, from the test labora¬ 
tory if needed for further analysis. 

The BICT SYSTEM OFFERS vendors 
the security of an impartial test of a 
board’s conformity to bus standards. 
Conformity is a necessary precondition 
for the interoperability of boards shar¬ 
ing a bus. 

To test a product, a center needs in¬ 
formation only about the features and 
parameters of the board, not design in¬ 
formation such as schematics, PROM, or 
PLA data. Design information stays with 
the originator. 

The mainly automated test procedures 
ensure cost- and time-effectiveness and 
guarantee objective results. A basic aim 
of the BICT project was to eliminate as 
much human influence as possible. The 
results depend only on the strengths and 
weaknesses of the test system. We are 
optimistic that any remaining weaknesses 
will be eliminated as the test system 
matures. 


Test Unit under test 

controller (with 1 -Mbyte memory) 



Message passing protocol test 
The following errors were identified: 

Start End 


Rule number 

Date 

Time 

Date 

Time 

Result 

13.4-1 

910522 

083910 

910522 

083910 

passed 

13.5.1.2-4 

910522 

083910 

910522 

083911 

passed 

13.5.1.2-5 

910522 

083956 

910522 

083957 

failed 

*** buffer grant message received from UUT 




13.5.1.2-7 

910522 

084002 

910522 

084003 

passed 


Details of failed tests 


The following test cases failed: 

Rule 13.5.1.2-5 

If an agent is not able to receive a requested solicited message, it must send a 
buffer reject message to the agent that has sent the buffer request message. 


Figure 8. BICT report section. 
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BICT 


Starting this year, the service will be available at the test 
centers involved in the BICT project: the VDE Test and Cer¬ 
tification Institute in Offenbach, Germany; S. Seferiades and 
Associates (SSA) in Athens; and Istituto Italiano del Marchio 
di Qualita (IMQ) in Milan. (D 
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How I spent my Christmas vacation 


y holiday tradition is to attack the big 
projects that I’ve been putting off all 
year. This year I plunged into the 
Macintosh. 

I have a Mac SE/30. I use it eveiy day in my 
consulting work. I also use it for reviewing the 
Macintosh software that I receive in abundance. 
This is the source of the problem. 

In my consulting work I do simple word pro¬ 
cessing. I occasionally transmit text files over 
phone lines. My Mac SE/30’s original memory 
size of 2 Mbytes was more than adequate for 
these tasks. 

I enjoy reviewing new software, since the pro¬ 
cess forces me to learn to use powerful new 
tools that my natural inertia would otherwise 
keep me from exploring. Unfortunately, new 
software grows by leaps and bounds, and I can’t 
keep up with it unless my computer does the 
same. Thus, my project for the holidays was to 
upgrade my Macintosh to 8 Mbytes. 

Memory upgrade 

Looked at objectively, a Macintosh memory 
upgrade is a pretty small project. It takes less 
than half an hour, but somehow I managed to 
put it off for nearly four months after I had ev¬ 
erything I needed in hand. I probably would 
never have done it at all if it had not been for 
the idiot-proof package put together by the mail¬ 
order company called Mac Warehouse. 

I called Mac Warehouse’s toll-free number 
(800) 255-6227 to order the 1-Mbyte single in¬ 
line memory modules (SIMMs) required to up¬ 
grade a Macintosh. As I write this, their advertised 
price is $59 each. I needed eight of them to up¬ 
grade from 2 Mbytes to 8 Mbytes. Each one re¬ 
places one of the original eight 256-kilobyte 
SIMMs. The Mac Warehouse people told me I 


0272-1732/92/0200-0065$03.00 © 1992 IEEE 


needed a simple toolkit, including a grounding 
wrist strap, which they sold me for $9. They 
included a video cassette demonstration of the 
upgrade process. The entire package showed 
up at my house the next day. Then it sat in my 
office for more than three months. 

The video is marvelous. I am about as 
unmechanical as a graduate of an engineering 
school could possibly be. Removing the SE/30’s 
motherboard would have been beyond me with 
only written instructions. A woman with a reas¬ 
suring manner demonstrated all of the simple 
but hard-to-explain-in-print steps required to 
upgrade any Mac’s memory. With exquisitely 
manicured hands she deftly manipulated tools, 
eased wire harnesses out of hard-to-reach sock¬ 
ets, and snipped a resistor lead. She carried on a 
running description of what she was doing, 
calmly mentioning the inherent dangers at cer¬ 
tain points. My favorite was when she said, “If 
you crack the video tube, it will implode.” 

I carried out the entire task in about a half 
hour. I sat on my living room rug with the 
grounding strap on my left wrist, the VCR re¬ 
mote control unit by my right hand, and the TV, 
my Macintosh, and my tools in front of me. The 
video runs 11 minutes, but I kept replaying key 
parts until I was sure I understood what I was 
doing. When I reassembled my Mac, everything 
worked perfectly. 

Operating system upgrade 

I would never change my operating system 
software if I didn’t have to. Unfortunately, new 
software invariably makes old operating systems 
obsolete. When I loaded the latest version of 
Mathematica onto my Macintosh, it immediately 
announced that it wouldn’t work on the version 
of the Macintosh operating system that I was 
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running. I could have put in a minor 
upgrade from 6.03 to 6.07, but I de¬ 
cided to switch to System 7, Apple’s 
long-awaited major upgrade. 

The Berkeley Macintosh Users’ 
Group (BMUG) is a wonderful resource 
for anyone who uses Macintosh com¬ 
puters. When I decided to upgrade to 
System 7, I called them at (310) 549- 
BMUG. I bought their package deal, 
which includes the 10-disk upgrade set 
from Apple and The Little System 7 Book 
by Kay Yarborough Nelson (Peachpit 
Press, Berkeley, 1991,158 pp., $12.95), 
all for $25. I did this about the same 
time I bought the SIMMs, and every¬ 
thing sat around my office for just about 
as long. 

System 7 is quite different from pre¬ 
vious Macintosh operating systems, but 
by the time I sat down to use it I felt 
right at home. I attribute this to the 
quality of the upgrade package that 
Apple put together, the excellence of 
The Little System 7Book, and the many 
informative articles about System 7 in 
the BMUG Newsletter. 

Users have come to expect installa¬ 
tion programs to guide them through 
the installation of large application 
packages. While installation programs 
have certainly improved considerably 
over the last few years, Apple’s upgrade 
software is a cut above any other that 
I’ve seen. It is a model for others to 
follow. Especially impressive were two 
HyperCard stacks designed to familiar¬ 
ize users with the new features of Sys¬ 
tem 7. The one dealing with the new 
networking features is especially im¬ 
pressive. It simulates the behavior of 
the system, allowing you to make menu 
selections, click buttons, and so on, just 
as you would if you were performing 
the networking functions it is teaching 
you. 

While my move to System 7 was 
easy, it also had its rough spots, all of 
which hinge on compatibility. Because 
the underlying software of System 7 
differs substantially from previous ver¬ 
sions, many software packages de¬ 
signed for previous versions will not 
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run. The Apple upgrade program be¬ 
gins by printing a list of all of the ap¬ 
plications on your machine. It labels 
each one as “fully compatible,” “mostly 
compatible,” “must upgrade,” or “in¬ 
formation not available.” Unfortunately, 
the last two categories predominated 
on my machine. The only ones listed 
as fully compatible were Word, Excel, 
Mathematica, and Mac Draw II. 


System 7 differs 
from previous 
Mac operating 
systems; but I felt 
right at home 
with it. 


The compatibility printout contains 
helpful information on upgrading old 
software. It lists your version number 
and that of the first compatible version. 
It gives the phone number of the manu¬ 
facturer of each package you need to 
upgrade. In some cases it indicates that 
the upgrade package actually includes 
a compatible version. Unfortunately, it 
appears to say that the upgrade pack¬ 
age includes a compatible version of 
HyperCard, but this is only true if you 
buy the System 7 Personal Upgrade Kit 
from Apple. My BMUG package did 
not contain HyperCard, and the 
HyperCard version that came originally 
with my SE/30 is incompatible with 
System 7. So I had to scurry around to 
obtain a working version. 

While simple incompatibility with the 
programming of System 7 is a prob¬ 
lem for most packages, some have an 
even worse problem. System 7 makes 
a large part of what they do obsolete. 
The popular Suitcase II and Master Jug¬ 
gler programs are prime examples. 
Many of their capabilities for manag¬ 
ing fonts, sounds, and desk accesso¬ 


ries are no longer needed. System 7 
handles sounds and fonts by allowing 
you to drag their icons into or out of 
the system file. It eliminates the dis¬ 
tinction between applications and desk 
accessories altogether, since it now in¬ 
corporates the features of Multifinder. 
The apple menu simply lists the con¬ 
tents of a folder in the system file. The 
apple menu folder can contain any 
application, document, or folder. Se¬ 
lecting an item from the apple menu 
causes the item to be opened. 

Another important feature of System 
7 makes front-end programs like Power 
Station less important. System 7 allows 
you to create an alias for any docu¬ 
ment, application, or folder. The alias 
consumes a tiny amount of disk space 
and serves as a pointer to the actual 
item. Thus, for example, you can leave 
Excel or Word in its own folder but 
place an alias for each in the apple 
menu folder or the start-up folder. You 
can make aliases for the apple menu 
folder and the system file and leave 
the aliases on the desktop. I made an 
alias for the text file that contains this 
column and placed it in the apple menu 
folder. When I select “Column” from 
the apple menu, the system opens this 
text file and the corresponding word 
processor. 

One important Power Station func¬ 
tion that I haven’t figured out how to 
do in System 7 is to associate an icon 
with an application and a specific set 
of documents to be opened with it. I 
would certainly keep Power Station 
under System 7 for its document¬ 
handling capabilities alone, but unfor¬ 
tunately my version caused the system 
to crash when I tried to run it. The 
Apple compatibility checker provided 
no information about Power Station, 
so this will take additional work on my 
part. I encountered a number of other 
small nuisances like that during the 
upgrade, but on the whole I’m quite 
happy. 

As I write this column using Microsoft 
Word running under System 7, I am 
simultaneously also running Mathe- 










matica, HyperCard, and the Variable 
Symbols Mathematica Help Stack (see 
Micro Review, Aug. 1991). I even have 
another 250 Kbytes left over in my new 
8 -Mbyte memory. I’m impressed. I 
wonder how long this euphoria will 
last. 

Books 

The holidays are also a nice time to 
sit by the fire reading, so I managed to 
look at a number of good books. One 
of the best of these (see next para¬ 
graph) is a little old for a computer 
book, and it does show its age in 
places, but it is so good that it’s woith 
reading anyway. I hope the publishers 
bring out an updated version. 

The Sachertorte Algorithm and 
Other Antidotes to Computer Anxi¬ 
ety , John Shore (Viking, New York, 
1985, 286 pp., $16.95) 

Shore states his purpose to be pro¬ 
motion of a general understanding of 
computers. As a reflective and witty old 
hacker, he is well qualified to achieve 
that purpose. It looks to me as though 
he has done so, although I can’t look 
at the book through the eyes of the 
intended audience. All I can say is that 
again and again he makes points that 
I’d want to make to an interested and 
intelligent beginner. It’s hard to com¬ 
municate the overall effect of this by 
quoting isolated phrases, but here are 
a few that appealed to me: 

In general, jargon-filled error 
messages are symptoms of a 
basic problem, namely that the 
designers and programmers of 
many office and personal com¬ 
puter systems have failed to 
separate their own concerns 
from those of the user. 

If you buy a new car and 
spend the next year having the 
dealer fix things that didn’t 
work right to begin with, you 
don’t say that your car is being 
maintained; you say you 


bought a lemon. By this crite¬ 
rion, most software products 
are lemons. 

While standing recently in a 
cold shower, I had occasion to 
think about replacing my hot 
water heater. Because I had 
been thinking about software 
engineering before the water 
turned cold, 1 was impressed 
by the extent to which I could 
choose a new water heater 
without thinking about air con¬ 
ditioners, telephones, win¬ 
dows, or practically anything 
else except the number of 
people in the house and the 
number of dollars in my bank 
account. 

The title of Shore’s book derives from 
the extended example he gives of try¬ 
ing to use his mother’s recipe for Aunt 
Martl’s Sachertorte. He illustrates the 
process of stepwise refinement as he 
moves from his mother’s original in¬ 
structions (prepare ingredients; bake 
cake) to a more detailed sequence 
of steps that are all within his own 
more limited repertoire of cooking 
operations. 

Shore wants his readers to under¬ 
stand that programming computers is 
an exercise in careful and precise com¬ 
munication. The programmer must 
communicate with the machine, but 
even more importantly, the program¬ 
mer must communicate with other 
human beings. In this sense, program¬ 
ming is a literary activity. When seen 
this way, some of the problems of the 
“software crisis” are easier to under¬ 
stand. Writing is easy, but writing well 
is hard. 

Shore also wants us to look at pro¬ 
gramming as mathematics and as ar¬ 
chitecture. Here he laments the fact that 
millions of people are acquiring per¬ 
sonal computers and programming 
them with the languages and tech¬ 
niques of the sixties. The concepts of 
abstract specification and infonnation 


hiding that were introduced in the early 
seventies are only beginning to be 
widely used today. Proofs of correct¬ 
ness, long recognized to be theoreti¬ 
cally possible and highly desirable, are 
still no more than classroom exercises. 

The tools of the writer, the math¬ 
ematician, and the architect are the keys 
to managing the complexity of large 
software packages. The failure to man¬ 
age complexity, in Shore’s view, is the 
cause of our current software crisis. 
Shore does a good job of explaining 
and illustrating the problems of com¬ 
plexity and the failure of our tools in 
terms that should be accessible to read¬ 
ers of all backgrounds. 

If you see this book in your local 
bookstore, buy a copy to give to a 
nontechnical friend. You might enjoy 
reading it yourself first. 

Mathematica: A Practical Ap¬ 
proach , Nancy Blachman (Prentice 
Hall, Englewood Cliffs, N.J., 1992, 380 
pp., $30) 

I wanted to install 8 Mbytes of 
memory in my Macintosh so I could 
run Mathematica comfortably. Unfor¬ 
tunately, inadequate memory is not the 
only obstacle to using Mathematica. 
Mathematica is a large program with 
many capabilities, and before 
Blachman’s book we had no ad¬ 
equately tutorial introduction to it. 
Stephen Wolfram, the creator of 
Mathematica, wrote a book that is sup¬ 
plied with the program, but as 
Blachman points out, learning 
Mathematica from that book is like 
learning English from a dictionary. 

In my August 1991 column, I dis¬ 
cussed some of the work Nancy 
Blachman and her company, Variable 
Symbols, have done to teach people 
to use Mathematica. She based the cur¬ 
rent book on undergraduate courses 
she taught at Stanford University. It 
draws on her extensive experience with 
the difficulties people encounter in 
learning Mathematica. 

I’m impressed by the quality of this 
book. Blachman provided Prentice Hall 
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with camera-ready copy, so she can 
take credit for its attractive appearance 
and its excellent editing. Mathematica 
can also take some credit, since the 
book began as a Mathematica 
notebook, and many of the illustrations 
were originally done using Mathe¬ 
matica. 

Learning Mathematica is not my top 
priority, so working my way through 
this book will be a background project 
for a while. In my next column I’ll tell 
you how it’s going. So far, I like 
Blachman’s approach. 

The Parents Guide to Educational 
Software , Marion Blank and Laura 
Berlin (Microsoft Press, Redmond, 
Wash, 1991, 424 pp, $14.95) 

The authors of this book are an edu¬ 
cational psychologist and a develop¬ 
mental psychologist. Their intended 
readers are parents who want to give 
their children the advantages of com¬ 
puter-assisted education but don’t 
know what to do next. As the authors 
point out, 

Currently you can choose from 
over 10,000 programs, cover¬ 
ing almost every subject 
imaginable. And, as always, ex¬ 
cellence is rare. If you’re like 
most parents, you probably 
don’t see this array of options 
as a pool of resources but as a 
mire of confusion. 

The authors have selected more than 
200 programs for careful rating. They 
have applied several criteria to this se¬ 
lection, and some of these criteria dif¬ 
fer from those used in selecting 
programs to be used in a school set¬ 
ting. Basically, all of the programs in 
the book are of high quality, attractive 
to children of the intended age group, 
sufficiently varied that children won’t 
tire of them immediately, and inexpen¬ 
sive enough for many families to con¬ 
sider buying them. Where possible the 
authors tried to select programs that 
are helpful to the 10-15 percent of the 
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students who have learning difficulties 
in specific learning areas. 

Most of the book is devoted to thor¬ 
ough reviews of the selected programs, 
but the authors begin with helpful in¬ 
troductory material. They explain how 
computers can help with learning. Then 
they review the major outlines of the 
elementary school curriculum, focus¬ 
ing on what skills are required by each 
area of study. Finally, they talk about 
how to set up an educational computer 
center at home and how to help your 
child use the programs that you 
acquire. 

The reviews all follow a common 
format. The authors identify each pro¬ 
gram as being for a specific age range 
and school grade range. Then they 
summarize the hardware requirements, 
the abilities a child must have to use 
the program, and the curriculum areas 
that the program addresses. A detailed 
discussion of the program itself follows 
these short summaries. 

I have no experience with any of 
the programs the authors have in¬ 
cluded, so I can’t judge their reviews. 
However, the reviews that I read ad¬ 
dressed issues that I’d want to consider 
in buying educational software. One 
interesting example was a $14.95 pro¬ 
gram for children in the two-to-four age 
range. The child would not be able to 
start the program because of a compli¬ 
cated copy protection scheme, so the 
parent would always have to be there 
to start it. The authors emphasize this 
point without passing judgement. I 
admire their restraint. 

I think this book is a really good in¬ 
vestment for anyone considering buy¬ 
ing educational software. 

Annual phone list update 

My year end would not be complete 
without the annual overhaul of my 
phone list. This year I finally automated 
the process. 

Address Book Plus (Power Up Soft¬ 
ware Corp, 2929 Campus Drive, San 
Mateo, CA 94403, (415) 345-5900, 
$99.95) 


This program provides what most 
people need to handle their little black 
books. It lets you build a small data¬ 
base, define selection and sorting cri¬ 
teria, and generate reports in a variety 
of useful formats. These features are 
mostly built in, so that users don’t need 
to use query or report generation 
languages. 

The program supports printing for¬ 
mats corresponding to the popular 
pocket and desk appointment books. 
It also has a special Instabook format, 
which lets you print a stack of two- 
sided pages and turn them into a 
pocket-size address book, simply by 
cutting, folding, and stapling. I tried it, 
and it really works. 

The program offers an integrated 
telephone dialing facility. It seems to 
have all of the flexibility necessary to 
deal with prefixes, area codes, choice 
of long-distance carrier, international 
dialing, modem control, and so on. 

Power Up has gone to great lengths 
to make this program fit into your way 
of doing things. It can import or ex¬ 
port files in formats that you define, 
and it has built-in formats allowing it 
to import from HyperCard and to ex¬ 
port to personal organizers from Casio 
and Sharp. It also works with Power 
Up’s mail merge program (Letter Writer 
Plus). 

It’s taken me a long time to get 
around to using a program like this, 
because no program comes close to 
doing all the little things I do by hand, 
but this one is good enough. What it 
does for me outweighs the things I’ll 
have to give up. 
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Understanding the Acronym Tower of Babel 


a am often asked what a particular acro¬ 
nym means, or where it came from. After 
reflecting on these questions, I realized 
that they might be the meat for an interesting 
column. After all, we use these “shortened” words 
to convey large amounts of information within a 
standard, or other technical documents. 

Interestingly, many of the acronyms and defi¬ 
nitions used today came about because of the 
telegraph and Teletype era. Since telegraphers 
had to tap in each letter of a word sent, they 
naturally wanted to use the smallest words pos¬ 
sible. Consequently, a unique shorthand was cre¬ 
ated. This shorthand carried over to Teletype 
users, partly due to the slowness of the system 
and the difficulty of typing on the keyboard. Any 
of you that remember a KSR 35 will understand 
what I mean. 

Interestingly, World War II and the Korean 
conflict also contributed to the acronym pool. The 
idea was to send as much information as fast as 
possible so the enemy wouldn’t have a chance to 
locate the signal or decipher it. 

Another interesting note concerns the contri¬ 
butions industiy has made to the acronym Tower 
of Babel. The airline industry, in days before so¬ 
phisticated reservation systems, used a form of 
shorthand called Fast Talk. This method let an 
agent in Chicago call an agent in Des Moines to 
reserve a seat through to Los Angeles. 

Even with the advent of computer systems and 
better communications than Teletype, the airline 
industry developed a worldwide standard abbre¬ 
viation system called SIPP (Standard Information 
Processing Protocols) codes. This system of codes 
allowed a reservations agent at United Airlines to 
send a great deal of easily understood informa¬ 
tion about a passenger to another carrier. For 
example, C-19 meant that the passenger required 
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special handling. Other codes indicated special 
food or an unpleasant passenger. 

Similarly, the airlines also had a system for 
identifying lost baggage. For example, 02 Blue 
indicated a blue, hard-sided Samsonite bag and 
03 denoted an American Tourister bag. Type 20, 
on the other hand, was a folding garment bag. I 
was one of the developers of United Airlines To¬ 
tal Apollo Baggage System (TABS) that made use 
of this information to automatically search for lost 
bags—a system that ultimately saved United sev¬ 
eral million dollars a year in mishandled baggage. 

Of course, NASA (the US National Aeronautics 
and Space Administration) probably gets the prize 
for developing acronyms. It isn’t unusual to pick 
up a NASA-developed document and find that it 
can’t be read without the help of an acronym 
glossary close at hand. NASA’s rationale is that so 
much information has to be conveyed that ab¬ 
breviations, acronyms, and initialism (AA&I) en¬ 
hance the readability of the document. 

Because of limited space, I’ve taken two ap¬ 
proaches in my acronym/definition list beginning 
on the next page. I’ve chosen some interesting 
acronyms and expounded upon them, and oth¬ 
ers I’ve just listed with their meaning. I could write 
volumes on this area alone or include multiple 
meanings, but I won’t. I covered lots of ground 
with this column. You can probably guess, how¬ 
ever, that I only touched the surface. If you have 
a favorite acronym that you would like to share 
and can tell a short (no more than two-paragraph) 
story, I’ll include it in forthcoming Micro Stan¬ 
dards columns. 

Finally, some patting thoughts: Didja [sic] know 
that Soroc, the now-defunct spin-off terminal 
manufacturer from Lear Siegler, used an anagram 
for its name? Take a close look at Soroc, and you’ll 

continued on p. 72 
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Ack: acknowledgment, an old 
communications term that was used 
with Teletype systems. The telegra¬ 
pher on the receiving end would in¬ 
dicate that a connection was made, 
or a message received, either by 
sending AK (later ACK) or their ini¬ 
tials. 

ADC: analog-to-digital conversion. 

ADM: Association for Information 
and Image Management. 

ANSI: American National Stan¬ 
dards Institute, the governing body 
for the management and creation of 
standards (see IEEE, MSC, SPARC, 
and TCCM). 

AT: Advanced Technology, an 
IBM-fostered term describing the ba¬ 
sis of the PC bus architecture. This 
architecture built on the 8-bit and 16- 
bit PC and XT (Extended Technol¬ 
ogy) buses of IBM’s earlier machines. 

Ata: at attachment. 

BER: bit error rate, an important 
metric when referring to storage or 
communications devices, which 
gives the number of bits at fault dur¬ 
ing various types of measures of the 
device. 

BSR: Board of Standards Review. 

Bsy: busy. Again, this is a term 
from telegraphy days. When queried, 
the downline telegrapher would re¬ 
spond with a busy reply, requesting 
a hold on traffic. Later with Teletypes, 
a busy signal was sent out much like 
the busy signal on your telephone 
being high, thus denoting busy, or 
setting of a “busy” bit in a UART, for 
example. 

CAM: common access method/ 
content access memory. 

CBEMA: Computer Business 
Equipment Manufacturers Associa¬ 
tion. 

CISC: complex instruction-set 
computer, a class of processor that 
has a large number of instructions to 
choose from. An example is the Intel 


Glossary 

80386 or Motorola 68040. (See RISC). 

CKD: count-key-data. 

CR: connection resource. 

CRC: cyclic redundancy check. 

CS: continuous servo. 

DADI: Directly Addressable Device 
Interface. 

DASD: direct access storage device, a 
disk or other secondary storage device 
that permits access to a specific sector or 
block of data without first requiring the 
reading of the blocks that precede it. 

DMA: direct memory access, transfer 
of data between the memory and a pe¬ 
ripheral (or another memory) without in¬ 
tervention of the CPU (besides DMA 
controller initialization). 

DOTS: digital optical tape system. 

DTA: dedicated test article. Various 
test units may be used for various pur¬ 
poses in a test enviomment. A DTA is 
specifically dedicated to test purposes 
and as such may have special test ports 
and special power inputs, and the cabi¬ 
nets may or may not exist. 

DUT: device under test. This refers to 
the actual hardware that is attached to 
test equipment (see UUT). 

ECMA: European Computer Manufac¬ 
turers Association. 

EIA: Electronic Industries Association. 

EISA: Extended Industry Standard Ar¬ 
chitecture. 

Escon: Enterprise Systems Connec¬ 
tion. 

ESD: electrostatic discharge. 

ESDI: Enhanced Small Disk Interface. 
This is the analog to SCSI. This interface, 
which was developed by Maxtor, pro¬ 
vides a fast channel for disk drives. Origi¬ 
nally, the “D” meant device, but the 
developers changed their aspirations 
and decided that a fast interface for disks 
was good enough. IBM did choose ESDI 
for early PS/2 models but is now mov¬ 
ing exclusively to SCSI. This is one of 
the many interfaces that I. Dal Allan of 
ENDL Consulting, Saratoga, Calif., can 
take credit for. 


FAT: file allocation table, a table 
that contains mappings of the physi¬ 
cal locations of all of the clusters in 
all files on disk storage. 

FBA fixed-block architecture. 

FC: fiber channel. This is techni¬ 
cally an initialism and is a bad 
choice—writing out fiber channel 
makes more sense than using the 
initials. 

fci: flux changes per inch. 

FDDI: Fiber Distributed Data Inter¬ 
face, protocol description for optical 
networks and copper coaxial cable 
networks. 

FEU: functional equivalent unit, a 
test unit equivalent to the actual fin¬ 
ished device. As such, it has the same 
appearance as the finished article and 
operates in accordance with the func¬ 
tional requirements. 

FFS: flat-file system. 

FOM: figure of merit, another 
poorly thought-out use of acronyms. 
A figure of merit is a number that is 
determined from some weighting 
system that has preestablished 
boundaries. 

fps: frames per second. 

FPT: Forced Perfect Termination. 
This is a nifty trick developed by IBM 
to create a termination scheme that 
follows the changes in the impedance 
of a line. Theoretically, it implies an 
infinite length without attenuation of 
the signal. However, physics does get 
in the way, and the signal will die at 
some point. 

FRAM: ferromagnetic memory. 

Gbsi: gigabits per square inch. 

HBA: host bus adapter. (This is 
SCSI talk.) 

HIPPI: High-Performance Parallel 
Interface. 

HIPPI-FP: Framing protocol. 

HIPPI-LE: IEEE Std 802.2 link en¬ 
capsulation. 

HIPPI-PH: physical layer. 

HSC: hierarchical storage controller. 
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Glossary’ (continued) 

EDC: insulation displacement con¬ 
nectors. 

EEC: International Electrotechnical 
Commission. 

IEEE: Institute of Electrical and 
Electronics Engineers. 

nST: Institute for Information Stor¬ 
age Technology. 

IOPS: input/output (requests) per 
second. 

EPI: Intelligent Peripheral Interface, 
an interconnection standard that 
grew out of the Intelligent System 
Interface developed by ISS Sperry 
Univac. This interface is designed for 
large systems that have very high 
speed I/O channels. 

JISC: Japan Industrial Standards 
Commission. 

JTTC1: Joint Technical Committee 1. 

Lun: logical unit. 

MCAV: modified constant angular 
velocity, a method used on compact 
disks and some read/write optical 
disks. 

Mflops: millions of floating-point 
operations per second, a metric used 
to benchmark the overall processing 
capability of a microprocessor and 
math coprocessor. 

MIG: metal-in-gap, a specialized 
read/write transducer used in high- 
perfonnance disk drives. 

MIPS: millions of instructions per 
second, a measure of a processor’s 
horsepower. It is usually compared 
with a known system such as a Digi¬ 
tal Equipment Corp. POP 11/780 that 
has a MIPS rating of 1. 

MO: magneto-optical, a technique 
of combining magnetic recording 
with optical recording to produce a 
high-capacity read/write device. 
Geoffrey Bate, a Santa Clara Univer¬ 
sity fellow, pioneered this technique 
while at Verbatim Corp. 

MR: magneto-resistive, a single¬ 
pole read/write transducer used in 
high-performance and high-bit 


density disk drives. 

MSC: Microcomputer Standards Com¬ 
mittee, the governing body for bus 
standards. 

MTBF: mean time between failures. 

MTBS: mean time between stops. 
This has nothing to do with the rapid 
transit system in your town; rather, it is 
a measure used with rotating memory 
systems. 

MTDL: mean time to data loss. This 
is my favorite acronym. Apparently, it 
describes what we all fear most (by the 
way, I saved my work at this point). 
That’s the loss of all the work we have 
done up to the deadline because of a 
power failure, your cubical mate spills 
coffee down the grill work of your ter¬ 
minal, or the gnomes who control these 
things have decided its your turn in the 
barrel. 

MTSR: mean time to sendee repair. 

MTTR: mean time to repair. 

NIC: newly industrialized country. 
This just shows how silly we get with 
acronyms, but it is actually used. 

NRE: nonrecurring engineering. 

NRZ: nonreturn to zero, a recording 
method that uses the zero crossing as a 
reference point. 

NSIC: National Storage Industry Con¬ 
sortium. 

NWI: new work items. 

OS. O/S: operating system. 

OSF: Open Systems Foundation. 

PCD: printed circuit disk. 

POH: power-on hours. 

PRML: partial response maximum 
likelihood. 

Prej: port reject. 

RISC: reduced instruction-set com¬ 
puter, a fast processor that has fewer 
instructions than a CISC and is defined 
as performing a register-to-memory 
move in one t (machine-cycle time). 

QIC: quarter-inch cartridge; also the 
name of the tape committee, chaired 
by Raymond Freeman, Freeman Assoc., 
Santa Barbara, Calif., which developed 
the quarter-inch format and recording 
standards. 


Rej: reject. Most people believe the 
general use of this term comes from 
the reject switch on phonograph 
record players that were marked “Rej’' 
on the top of the knob. 

RLL: run-length limited, which de¬ 
fines a code combining Is and Os that 
is a mathematical permutation allow¬ 
ing more information to be encoded 
on one transition. Notably, RLL is 
used in data recording and transmis¬ 
sion. 

R/W: read/write. 

SCSI: Small Computer Systems In¬ 
terface, one of the most important in¬ 
terfaces in the industry. SCSI grew out 
of the Shugart Associates System In¬ 
terface (SASI). The now-defunct 
Shugart took the idea to ANSI under 
the guidance of I. Dal Allan (the father 
of modern-day interfaces), and the re¬ 
sult was SCSI-1. Currently, SCSI-II—a 
robust 16-bit version with 32-bit capa¬ 
bility—is expected to be deemed an 
ANSI standard soon. 

SD3: project approval request 
(ANSI). 

SPARC: Standards Planning and 
Requirements Committee, the group 
you address when preparing a new 
standard; also a microprocessor archi¬ 
tecture. 

SPOOL: simultaneous peripheral 
operations on line, when a high¬ 
speed device like a disk is interposed 
between a running program and a 
low-speed device such as a printer. 

SSD: solid-state disk. 

SWG: Special working group. 

SSWG: Specific subject working 
group. 

TAG: Technical Advisory Group. 

TC: technical committee. 

TCMM: Technical Committee on 
Microcomputers and Microproces¬ 
sors. 

TEH: thin-film head, a special read/ 
write transducer that is made using 
semiconductor techniques. This type 
of head permits smaller geometries, 
thus more tracks per inch and greater 
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Glossary (continued) 

density of storage. 

TFM: thin-film media, a special re¬ 
cording media developed by using a 
sputtering process to lay down the 
thinnest possible layer of recording 
material to improve penneability and 
susceptibility to the recording 
process. This technique is used for 
optical recording, especially thermal- 


magneto recording (TMR) that uses both 
optical and magnetic methods, 
tpi: tracks per inch. 

TPS: transaction processing system. 
UART: Universal Asynchronous Re¬ 
ceiver Transmitter. 

ULP: Upper Layer Protocol. You can 
tell this came from technologists who 
deal with systems and storage devices. 
Communications people have their own 
jargon, and they would tend to point out 


the layer in the seven-layer OSI 
model. 

UUT: unit under test. (See DUT.) 
UUT is similar to a DUT depending on 
whom you talk to. Sometimes, UUT 
refers to software under test, while 
DUT refers primarily to hardware. 

WORM: write-once, read-many, an 
optical device that laid the foundation 
for the optical systems being used 
today. 


notice that it spells “Coors”—even their 
logo was the top of a beer can. 

Remember NBI, the fast-rising word 
processor company of the seventies 
and early eighties? NBI stood for “noth¬ 
ing but initials.” 

And, one of the more interesting ac¬ 
ronym histories belongs to EDN maga¬ 
zine. Most people erroneously think 
EDN stands for “Electronic Design 
News.” About 40 years ago when EDN 


Micro Law 


continued from p. 6 
prepare any response to your first. 

It is the policy of Bit Bucket Con¬ 
sulting Services, Ltd. not to have its 
personnel enter into confidentiality 
relationships regarding pending litiga¬ 
tion until the engineer and the com¬ 
pany are sure they can appropriately 
support the position of the party for 
ivhich the consulting/expert witness 
services are to be performed. I tried to 
convey this during our telephone 
conference on November 14, and 
thought that I did. But apparently I was 
not successful in properly communicat¬ 
ing that position. I regret any confu¬ 
sion that may have been caused by 
failure of communication. 

Accordingly, I have not read the 
materials enclosed in your letters and 
will not do so unless and until you ad¬ 
vise me that I can do so without creat¬ 
ing any mutual problems. I will study 
the two SIMM patents and any other 


started, the name meant “Electrical De¬ 
sign News.” However, the magazine 
became genetically known as EDN, and 
the publishers decided that would be its 
official name. But H. Victor Daimm, the 
publisher during the seventies and 
eighties , felt EDN stood for “everything 
a designer needs,” which shows that an 
acronym never outshines brilliance. 

Want to send me your acronym 
story? Mail it to me at the address shown 


documents of public record that you feel 
lean appropriately examine in the light 
of the foregoing policy of Bit Bucket 
Consulting Services, Ltd. Then I will at¬ 
tempt to reach a preliminary decision 
whether 1 feel that the patents are valid 
and whether I can othenvise suppon the 
basis of your claim against Toshiba and 
NEC as to SIMMs infringing your two 
patents. I will check with our company 
records to make sure no conflict exists 
and will promptly get hack to you. 

If there is any problem ivith this, 
please advise promptly. Thank you for 
your interest in having us assist you. 

Sincerely yours, 

Bit Bucket Consulting Services, Ltd. 

By John Q. Kludge 

This is not perfect, but probably will 
work. For one thing, because you can¬ 
not claim not to have seen the enclo¬ 
sures, it may be argued that you must 
have read them. I don’t know what you 


earlier; I’ll be happy to receive it. 
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can do about this, unless a lawyer or 
clean-room service reads and screens 
all your mail for you. I welcome sug¬ 
gested improvements from readers. In 
any event, those of you who want to 
consult or serve as expert witnesses 
now have an interesting new problem 
to worry about. 
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Firmware standards 


[The PI275 working group is developing a pro¬ 
posed standard for boot firmware based on the 
machine-independent Open Boot firmware. 
Mitch Bradley discusses Open Boot’s design and 
benefits of standardization. 

I invite readers to send information on a tool 
or method that solves problems, for consideration 
in future columns. -C. Wj 


Mitch Bradley 
Sun Microsystems 


B irmware is the ROM-based software that 
controls a computer between the time it 
is turned on and the time the primary 
operating system takes control of the machine. 
Firmware’s responsibilities include testing and 
initializing the hardware, determining the hard¬ 
ware configuration, loading (or booting) the 
operating system, and providing interactive de¬ 
bugging facilities in case of faulty hardware or 
software. 

Historically, firmware designs have been pro¬ 
prietary and often specific to a particular bus or 
instruction set architecture (ISA). This need not 
be the case. Firmware can be designed to be 
machine-independent and easily portable to dif¬ 
ferent hardware. There is a strong analogy with 
operating systems in this respect. Prior to the 
advent of the portable Unix operating system in 
the mid-seventies, the prevailing wisdom was 
that operating systems must be heavily tuned to 
a particular computer system design and thus 
effectively proprietary to the vendor of that 
system. 
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Standardization 

Standardizing firmware would offer several 
advantages to designers and users, including 

• Consistency across different systems. It 

is easier to learn one firmware command 
set and use it across different systems from 
different vendors than to have to learn a 
different set of commands for every differ¬ 
ent system. 

• Avoiding duplication of effort. Without 
a standard, every computer manufacturer 
must reinvent the firmware wheel. This ef¬ 
fort is largely wasted. The availability of stan¬ 
dard firmware offers manufacturers the 
option of buying proven firmware off the 
shelf, rather than designing and building it 
from scratch. Porting is almost always 
cheaper than writing from scratch. The ex¬ 
istence of a standard also will raise the level 
at which value is added, allowing improve¬ 
ments to be made on top of a proven base, 
rather than continually redesigning the base. 

• Reducing design cycle times. Designers 
would have one less thing to build for each 
new system architecture if standardized firm¬ 
ware is available. 

• Good, not half-hearted, firmware. Since 
firmware is not visible to the average user 
and is rarely cited in marketing literature or 
measured in benchmarks or reviews, it is 
often treated as an afterthought or a neces¬ 
sary evil. Consequently, firmware design has 
not received the same level of attention as 
other, more visible software components. 
The requirements imposed on firmware by 
the desire to standardize it force us to take 
a hard look at many issues that might oth¬ 
erwise be swept under the rug. 
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Among the many firmware solutions 
available on different machines, Open 
Boot is the only one that has been pro¬ 
posed as a multivendor standard. 

Open Boot advantages 

We designed Open Boot firmware 
with open systems and standards in 
mind. We intended from the outset to 
port it to many different machines with 
widely varying bus structures, ISAs, and 
configurations. 

Plug-in drivers. A key Open Boot 
feature is support for self-identifying 
devices. Consider a computer with an 
open expansion bus, such as VMEbus 
or SBus. An independent board ven¬ 
dor (that is, not a system manufacturer) 
of a card that plugs into the bus wants 
the system to recognize and use that 
card. In an operating system environ¬ 
ment, this is easily accomplished. The 
board vendor supplies a driver on a 
diskette that can then be loaded onto 
a hard disk or installed into the oper¬ 
ating system. 

It is more difficult to support third- 
party devices in the firmware environ¬ 
ment because firmware operates before 
the system is ready to read the disk. 
Since it is difficult to merge third-party 
drivers into existing system ROMs, it is 
better to store a driver in a ROM on 
the card for the plug-in device to which 
it applies. Others have taken this ap¬ 
proach, but most existing firmware sys¬ 
tems store the driver in ISA-dependent 
machine language binary code, and 
thus it only works on computer sys¬ 
tems from a particular vendor. 

Open Boot also uses the plug-in 
driver technique. But instead of stor¬ 
ing those drivers in machine language, 
Open Boot uses FCode. FCode is a 
machine-independent, byte-coded in¬ 
termediate language for the Forth pro¬ 
gramming language. It is based on a 
stack-oriented virtual machine that may 
be easily and efficiently implemented 
into any computer. FCode drivers are 
incrementally compiled into system 
RAM for later execution. 

In addition to its use for firmware 


device drivers, FCode also provides a 
descriptive capability. Plug-in device 
cards use it to report their characteris¬ 
tics to the firmware and system soft¬ 
ware. Such characteristics may include 
the device name, model, revision level, 
register locations, interrupt levels, sup¬ 
ported features, and any other identifi¬ 
cation information that makes sense for 
the particular device. System software 
may use this information to automati¬ 
cally configure itself for correct opera¬ 
tion with particular devices. 

Interactive debuggers. Open Boot 
uses the same mntime system that ex¬ 
ecutes FCode drivers as the basis for 
an interactive Forth language inter¬ 
preter. This interpreter can be used as 
a programmable debugger, allowing 
developers, users, and service person¬ 
nel to isolate system problems in the 
event of a failure. 

Flexibility. We designed Open Boot 
for adaptability. Its notation and struc¬ 
ture for naming particular devices is 
based on a hierarchical device tree that 
mimics the bus configuration and 
physical addressing of the machine on 
which it is implemented. This structure 
applies equally well to simple, single¬ 
bus desktop machines and to back¬ 
room servers with multiple processors 
and complicated hierarchies of inter¬ 
connected buses. We designed the 
name space for individual device 
names so that allocating names requires 
no central authority. Companies can 
design their products without appeal¬ 
ing to a master name arbiter. 

The Open Boot command language 
is open-ended. In addition to the stan¬ 
dard commands that are present on all 
implementations, an arbitrary number 
of new commands may be added at 
any time, even by the user. Such addi¬ 
tional commands may provide access 
to system-specific features or may sim¬ 
ply be customizations for the needs and 
tastes of individual users. 

Maintainability. Field ROM up¬ 
grades can be expensive. Open Boot 
provides a self-patching facility that 
allows many types of firmware bugs 
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to be fixed without changing the sys¬ 
tem ROM. The same facility can be used 
to add additional firmware capabilities 
to systems in the field, without chang¬ 
ing the ROMs. 

Towards standardization 

Sun Microsystems began developing 
Open Boot in 1987. We introduced 
version 1 with our Sparcstation 1 ma¬ 
chines. Version 2, introduced with 
Sparcstation 2, corrected a number of 
deficiencies (we learned from our mis¬ 
takes) and added several new features. 
We now use Open Boot on all of our 
current machines; approximately 
500,000 units in the field employ it. 

Sun has licensed its Open Boot 
implementation to several other com¬ 
puter manufacturers. Force Computers, 
a leading supplier of VMEbus products, 
intends to use Open Boot on future 
products across several processor fami¬ 
lies and buses. Sun Microsystems of¬ 
fers the source code for an 
implementation of Open Boot under a 
licensing arrangement. 

Working group. The IEEE PI275 
Open Boot Working Group is devel¬ 
oping a firmware standard based on 
Open Boot. The group meets monthly 
at various locations. Membership is 
open to all interested parties. For more 
information, contact the author (see 
box). 

Several IEEE bus standards are mak¬ 
ing provisions for Open Boot. The 
Futurebus+ standard includes a mecha¬ 
nism for using FCode ROMs for self- 
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identification. The VME-D draft docu¬ 
ment mentions a similar mechanism. 
SBus, under consideration for IEEE 
standardization, uses FCode. 

For a copy of the PI275 draft speci¬ 
fication contact the working group. The 
specification may be used and imple¬ 
mented without licenses or royalties. 

Mitch Bradley is a senior staff engi¬ 
neer at Sun Microsystems. He has de¬ 
signed analog and digital hardware, 
written Unix device drivers and other 
software, and been a system trouble¬ 
shooter. Most recently he worked on 
the design, implementation, and pro¬ 
motion of the Open Boot firmware. He 
owns Bradley Forthware, a small com¬ 
pany specializing in Forth implemen¬ 
tations for various computers. 

Bradley earned a BE in electrical en¬ 
gineering, computer science, and math 
from Vanderbilt University and an MS 
in electrical engineering from Stanford 
University. He also studied for a year 
at Cambridge University on a Churchill 
Scholarship. He is a member of the 
IEEE Computer Society, the Forth In¬ 
terest Group, and the ANSI Forth Stan¬ 
dards Team. 
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The limits of chip density 

Ware Myers, Contributing Editor 

Semiconductor density promises to 
increase for many decades to come, 
James D. Meindl of Rensselaer Poly¬ 
technic Institute predicted in an invited 
lecture to Supercomputing 91 in Albu¬ 
querque last November (J-D. Meindl, 
“Gigascale Integration (GSI) Technol¬ 
ogy,” Proc. Supercomputing 91, IEEE 
Computer Society Press, Los Alamitos, 
Calif., pp. 534-538). His analysis points 
to what he calls “gigascale integration,” 
or one billion transistors on a chip, by 
about the year 2000 with continued 
progress for some 30 years into the new 
century. 

Meindl analyzed a hierarchy of lim¬ 
its on transistor density: fundamental 
limits set by the laws of physics; mate¬ 
rial limits set by the physical structure 
and chemical composition of the likely 
materials, silicon and gallium arsenide; 
device limits such as, in the case of sili¬ 
con MOSFET devices, channel length, 
gate oxide thickness, and others; cir¬ 
cuit limits such as the constraints on 
switching logic circuits; and system lim¬ 
its such as clock skew. 

In addition, he considered the prac¬ 
tical limits influenced by manufactur¬ 
ing technology and economics. Meindl 
asked, “How many components/tran¬ 
sistors can we expect to fabricate in a 
single silicon chip that will prove to be 
useful, i.e., be economically viable, at 
some designated future time?” He quan¬ 
tified the practical limits in terms of 
three macrovariables: minimum feature 
size, the square root of the die area, 


0272-1732/92/0200-0075S03.00 © 1992 IEEE 


Send information for inclusion in Micro News 
one month before cover date to Managing 
Editor, IEEE Micro, TO Box 3014, Los Alamitos, 
C4 90720-1264. 


and the packing efficiency, that is, the 
number of transistors or components 
per minimum feature area. 

From principles of physics such as 
the Heisenberg uncertainty principle, 
Meindl derived two fundamental lim¬ 
its which he plotted on a log-log 
power-time delay field. One side of 
these curves constitutes an impossible 
region, an area where the power or 
the time delay is less than that required 
to perform a switching operation. 

In a similar manner he derived vari¬ 
ous other limits. He plotted limits on 
switching operations on the power¬ 
time delay field. Limits on transmission 
operations were plotted on a field of 
interconnection path length versus the 
corresponding signal propagation or 
response time. These lines gradually 
boxed in areas on the fields where op¬ 
erations are feasible. 

Along the way he discovered that 
“on the basis of bulk material limits 
per se, silicon is not inferior to gallium 
arsenide as a material for gigascale in¬ 
tegration.” That is, as switching circuits 
get closer to the limits he hypothesizes, 
silicon will reassert itself over gallium 
arsenide’s present electron-mobility 
advantage. 

After considering a number of 
metrics and limits, Meindl felt that one 
figure of merit, in particular, was sin¬ 
gularly instructive. The chip perfor¬ 
mance index, as he calls it, is the 
number of transistors on a chip, di¬ 
vided by the power- delay product. By 
power, he means the average power 
consumption during a binary switch¬ 
ing transition of a logic gate. By delay, 


February 1992 75 










Micro News 


he refers to the corresponding transi¬ 
tion (or delay) time. Thus, as the num¬ 
ber of transistors increases, the chip 
performance index also increases. And, 
as the power and/or delay time (in the 
denominator of the equation) becomes 
smaller, the chip performance index 
further increases. 

This index increased by about 10 13 
from I960 through 1990, Meindl cal¬ 
culated. He projects it to increase by 
another factor of 10 6 from 1990 through 
2020 . 

Tiny lasers 

Researchers at AT&T Bell Laborato¬ 
ries have made what they believe is 
the world’s smallest semiconductor la¬ 
ser: about 5 microns in diameter. Seen 
through a scanning electron micro¬ 
scope, the lasers look like microscopic 
thumbtacks with the head of each tack 
400 atoms thick. 

The lasers operate in what is called 
a “whispering gallery” mode, so named 
after the sound effect noted in large 
cathedrals, where a whisper along the 
wall can be heard all along the inside 
perimeter. Like these whispers, pho¬ 
tons travel with low losses around the 
edge of the laser. The lasers can be 
used as surface- or side-emitting de¬ 
vices. Each semiconductor disk laser 
is made of one or more layers of in¬ 
dium gallium arsenide sandwiched 
between two layers of indium gallium 
arsenide phosphide. 



AT&T's micro disk laser 


Nanotechnology 

As technology becomes feasible on 
smaller and smaller scales, scientists 
continue to explore organic and inor- 
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ganic paths in search of building blocks 
for what may one day be computers 
that function at the molecular or atomic 
levels. 

Natural nanocircuits. One bio¬ 
physicist is studying proteins that con¬ 
duct one-electron currents to find a 
model for protein-based nanocircuits— 
circuits made of organic materials 1,000 
times smaller than those used today. 

Jose Nelson Onuchic, an assistant 
professor of physics at the University 
of California, San Diego, believes it may 
be possible to design synthetic proteins 
that modulate minute currents, based 
on those found in vision, photosynthe¬ 
sis, and other biological functions. 
Onuchic believes an advantage to this 
line of study, as opposed to the tradi¬ 
tional study of inorganic materials, is 
that nature has already evolved spe¬ 
cial proteins to perform the tasks com¬ 
puter designers are trying to emulate. 

The goal of his research is the 
biochip , a molecular electron device 
that would include the protein or other 
organic chemical equivalents not only 
of wires but also of other computational 
gear, such as junctions, switches, re¬ 
sistors, and amplifiers. 

Towards that goal, Onuchic and his 
team are pursuing three lines of re¬ 
search. The first is biochemical, in 
which researchers study the features 
of natural proteins that control the rate 


of electron tunneling within and be¬ 
tween the amino acids that form the 
links of protein chains. The electron 
transfer rate depends on the nature of 
the chemical bonds traversed by the 
tunneling electrons as they jump from 
atom to atom and by the geometry of 
the protein. Therefore, a second line 
of research explores the rules that gov¬ 
ern the folding of natural proteins into 
their effective shapes. 

The third line of research, which 
builds on the results of the other two 
lines, is directed toward making a work¬ 
ing molecular electronic device. These 
efforts aim at creating a one-molecule 
memory element called a shift regis¬ 
ter, along with a one-molecule ampli¬ 
fier to read the information stored in 
the shift register. Onuchic believes it 
might also be possible for proteins used 
as memory and computing elements 
to self-assemble on biochip surfaces, 
as proteins do in living cells. 

Molecular manufacturing. Alter¬ 
natively, some researchers are work¬ 
ing towards molecular manufacturing, 
the ability to build nanomachines one 
molecule or atom at a time. In contrast 
to contemporary microcircuits, in which 
billions of electrons flow from one 
place to another, nanometer-scale cir¬ 
cuits might perform the same opera¬ 
tion with the change in shape or 
position of a single molecule. 


Micro bits 


Amador Corporation, a Minnesota- 
based, conformity assessment testing 
laboratory, has announced a coop¬ 
erative arrangement with the German 
VDE Testing and Certification Insti¬ 
tute to test US information tech¬ 
nology equipment for export to 
Germany. 

IBM and the Center for Advanced 
Research in Biotechnology are de¬ 
veloping a portable software system 

for computational structural biol¬ 
ogy, including programs to compute 


protein structural data and model 
protein molecules. 

Symbmath, an expert system that 
solves mathematic problems in sym¬ 
bolic formula or through numeric 
computation is available as share¬ 
ware under the file name SM13A.ZIP 
from the directory MSDOS.Calculator 
in Simtel 20 at File Transfer Protocol 
sites. Symbmath requires significantly 
less RAM than most comparable soft¬ 
ware—640 Kbytes, as opposed to as 
much as 4 Mbytes. 













To build such small devices, scien¬ 
tists must learn how to manipulate in¬ 
dividual molecules or atoms. Current 
methods of molecular manipulation 
rely on the random motions of atoms 
to form the desired compounds. But 
molecular nanotechnology aims at 
moving individual atoms or molecules 
and snapping them into place. 

Some progress has already been 
made. Scanning probe microscopes 
drag an ultra-fine stylus over a micro¬ 
scopic surface, thus mapping out its 
shape and allowing scientists to see in¬ 
dividual atoms. Researchers have found 
that they can sometimes get atoms to 
stick to the microscope tip, enabling 
them to move the atoms around. 

Using this technique, IBM research¬ 
ers in San Jose in 1989 positioned 35 
atoms of xenon to spell out their cor¬ 
porate logo. The achievement prompt¬ 
ed others to create their own nanoart: 
scientists at Hitachi’s Central Research 
Laboratory in Japan spelled out “Peace 
’91” by removing sulfur atoms from mo¬ 
lybdenum disulfide, and a Stanford 
University student inscribed the first 
page of A Tale of Two Cities on a sur¬ 
face about the size of a red blood cell. 

K. Eric Drexler, a visiting scholar at 
Stanford, is one of the founders of the 
Foresight Institute and the Institute of 
Molecular Manufacturing. He argues 
that manufacture at the molecular level 
is feasible and will one day produce 
nanomachines controlled by submi¬ 
cron, 1,000-MIPS CPUs and powered 
by nanomotors. Among the uses of 
such devices could be swimming 
through blood vessels to find and de¬ 
stroy viruses or bonding chlorine at¬ 
oms from the atmosphere to protect 
the ozone layer. The machines would 
be manufactured by other nanoma¬ 
chines called molecular assemblers, tiny 
robot arms that position molecules pre¬ 
cisely, bonding them into place in the 
design. (See related story on p. 7) 

Alternative methods rely on the ten¬ 
dency of certain molecular components 
to stick to one another in the design of 
structures, thus enabling self-assem¬ 


bling or self-replicating devices. Nadian 
Seeman, a chemistry professor at New 
York University, has created a design 
that allowed DNA strands to assemble 
themselves into a cubelike structure. 
Julius Reebek, Jr., a chemist at Massa¬ 
chusetts Institute of Technology, de¬ 
signed a synthetic molecule that serves 
as a template for two other molecules 
that combine to form a copy of the 
original. 

1.2-ps photodetectors 

A graduate student at the University 
of Michigan has used low-temperature- 
grown gallium arsenide to produce 
what she says may be the fastest light¬ 
detecting microchip. The device’s 1.2- 
ps response time is six times faster than 
commercial photodetectors. 

The chip’s developer, Yi Chen, cred¬ 
its the special type of gallium arsenide, 
developed by researchers at MIT, for 
its speed. The material responds very 
quickly to light, creating a current that 
stops and starts instantly as a laser pulse 
starts and stops. 



Chen's interdigitated electrodes 


Chen’s biggest technical challenge 
was developing an electrode structure 
to match the sensitivity of the gallium 
arsenide. She developed a way to use 
submicron electron-beam lithography 
to create interdigitated electrodes about 
0.2 microns or 2,000 angstroms apart. 
By condensing the photodetector’s 
electrodes into a smaller area, the de¬ 
sign prevents the loss of signal-carry¬ 
ing electrons in the gaps between 
electrodes. The detector achieves an 
internal quantum efficiency of 68 per¬ 
cent (collecting 68 out of every 100 


electrons) as opposed to traditional 
electrode structures that achieve about 

I percent. 

According to Chen, the device holds 
promise for fiber-optic communication 
networks hundreds of times faster than 
current systems, precision 3D inspec¬ 
tion systems, vehicle collision avoid¬ 
ance systems, and advanced medical 
imaging techniques. Chen is a gradu¬ 
ate student in the school’s applied phys¬ 
ics program and a post-doctoral fellow 
at AT&T Bell Laboratories. 

Cobol coinventor dies 

Rear Admiral Grace Murray Hopper, 
known as “the first lady of software,” 
died at her home in Arlington, Virginia, 
on January 4. She was 85. 

Hopper was a pioneer computer 
programmer for the US Navy and 
coinventor of Cobol. She is credited 
with coining the term bug to describe 
the problems that plague computers 
and programs. 

Admirers described Hopper as a vig¬ 
orous, tireless, and occasionally con¬ 
trary woman, with a healthy contempt 
for those unwilling to try new ideas. 
She once said, “The only phrase I've 
ever disliked is, ‘We’ve always done it 
that way.’ ” 

Hopper earned a PhD in mathemat¬ 
ics from Yale University and taught at 
Vassar College before joining the Na¬ 
val Reserve in 1943. After World War 

II she remained in the Naval Reserve 
and joined the company building the 
Univac I. The company later merged 
into Sperry, where she developed an 
idea that lead to Cobol. She was the 
US’s oldest active duty military officer 
when she retired in 1986. 
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Windows software 

Speech from Windows 

Using synthesized speech. Monologue for 
Windows reads aloud text, numbers, and data 
from Windows 3.0 and its applications. As in the 
manufacturer’s original version, Monologue for 
Windows converts digital information into simu¬ 
lated speech in two phases using a set of pho¬ 
netic translation and pronunciation rules. The 
software analyzes and translates text into sound 
descriptors, a phonetic language containing more 
than 1,000 rules to incorporate pitch, duration, 
and amplitude codes. Then it converts the lan¬ 
guage into speech signals. Algorithms drive 
speech tables that incorporate Riles for merging 
signals into continuous speech. 

Users can adjust the speed, pitch, and volume 
on screen. A dictionary manager lets users in¬ 
struct the software how to pronounce words, 
abbreviations, acronyms, and symbols that may 
not comply with phonetic rules. The software 
requires 2 Mbytes of RAM, MS/PC-DOS 3.1 or 
higher, Windows 3.0 or higher, and a hard disk 
with at least 2 Mbytes available. First Byte; $149. 

Reader Service No. 10 

Video window 

X.TV is a video window that users can ma¬ 
nipulate like any other window on workstations 
mnning X windows. Users can reposition X.TV 
anywhere on the screen, scale it to full screen, 
or reduce it to icon size. An on-screen control 
panel accesses audio-video functions, including 
volume, brightness, and contrast. A software 
push-button activates the frame grabber. Images 
can be named and saved to a disk file in raw 
data, TIFF, or Targa formats. 

X.TV receives images through VMEbus, SCSI, 
or RS-232 port buses from the manufacturer’s 
RGB/View hardware. RGB/View processes im¬ 


ages independently from the CPU and receives 
from a variety of sources including video cam¬ 
eras, scanning electron microscopes, and infra¬ 
red devices. RGB; $750 (X.TV), from $8,995 
(RGB/View). 


Reader Service No. 11 



RGB's X.TV video window 


VMEbus development kit 

Programmers can develop Microsoft Windows 
3.0 applications in a VMEbus environment us¬ 
ing the XVME-984 Windows VMEbus Toolkit. 
The kit is a utilities package designed to run 
with the manufacturer’s line of PC/AT VMEbus 
processors and support modules. It includes 
VMEbus Manager, a Windows 3.0 application 
that accesses and monitors the VMEbus. A li¬ 
brary of functions contains the low-level code 
that configures and communicates with the 
manufacturer’s VME I/O products. Demonstra¬ 
tion programs are also included that offer ex¬ 
amples of library uses. Xycom; $650 (with 
Windows), $500 (toolkit only). 

Reader Service No. 12 

PC-mainframe software 

The Extra PC-to-mainframe software for Mi- 
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Texas Instruments says its 
TMS320C40 is the first digital signal 
processor built specifically for parallel 
processing. One of the biggest chal¬ 
lenges facing the chip’s design team 
was how to debug a chain or array of 
multiple processors. Chief architect 
and program manager Ray Simar had 
seen customers purchase individual 
debuggers for each processor in a 
parallel system. 

“Those debuggers cost $10,000 to 
$20,000 each,” Simar said. "This gets 
prohibitive if you have 16 processors.” 

Space becomes a big problem, too. 
“They’d have nowhere to put them,” 
he added, “so they’d have them hang¬ 
ing from the ceilings.” 

TI avoids rooms full of hanging 
debugger boxes by incorporating an 
analysis module onto each C40 chip. 
The module links through a JTAG 
(IEEE Std. 1149.1) port to debugger 
software, accessible in a windows- 
format interface. Users can monitor 
and test processors in the system indi¬ 
vidually or globally. 

The analysis module is one of the 
key features of the C40, which Simar 
says is the first DSP built specifically 
for parallel processing. His team de¬ 
signed the chip from the ground up 
after looking at customers' difficulties 
incorporating their previous-genera¬ 
tion DSP, the C30, into parallel pro¬ 
cessing systems. 

According to Simar, customers 
linked the C30s together with multi- 


Parallel-processing DSP 



Texas Instruments' TM5320C40 DSP 


chip FIFO memories, an expensive pro¬ 
cess that also takes up space. “What 
we’ve done, essentially, is integrate 12 
FIFOs onto the chip,” he said. The C40 
has six communication ports that link di¬ 
rectly to other C40s with no external 
logic required. Thus, they link together 
tightly in pipelines, 2D arrays, or 3D ar¬ 
rays. This close proximity helps the C40 
achieve what TI says is the highest per¬ 
formance of any floating-point proces¬ 
sor: 275 MOPs with 320-Mbyte/s 
throughput per C40. Simar says the chips 
achieve even higher performance in par¬ 
allel processing systems. 

An on-chip, six-channel DMA copro¬ 
cessor acts as a clerk for the on-chip 


CPU. To allow the CPU to function 
smoothly, the DMA receives data, as¬ 
sembles it into packets, and stores it 
in a memory buffer (if necessary) un¬ 
til the CPU is ready for it. The C40 has 
an 8-Kbyte memory and two external 
memory buses to global and local 
memory. Through the interface, users 
choose how much on-chip memory 
space is allocated for the DMA and 
how much for the CPU. 

Among the tools available from the 
manufacturer are an in-circuit emula¬ 
tor that provides parallel debug capa¬ 
bilities for embedded applications, a 
parallel-processing development sys¬ 
tem, an ANSI-compatible C compiler 
with parallel-processing runtime sup¬ 
port library, an assembler and linker, 
and a state-accurate simulator. Third- 
party tools include the Multiprox 
code generation systems, an optimiz¬ 
ing Ada compiler, and the Spox real¬ 
time operating system. 

Simar says the C40 is suited for 
medical imaging (ultrasound, mag¬ 
netic resonance, and CAT scans), pat¬ 
tern recognition for robotics, radar 
processing, 3D graphics, and military 
uses (image recognition and track¬ 
ing). He also sees prospects for some 
nontraditional uses, including high- 
performance accelerators that attach 
directly to personal computers or 
workstations, high-bandwidth data 
transmission, and high-speed net¬ 
work interface. TI plans to ship in vol¬ 
ume by mid 1992. Texas Instruments. 

Reader Service No. 13 


crosoft Windows includes several easy- 
to-use features. Dynamic data exchange 
macros create automated information 
links between the mainframe and Win¬ 
dows applications. APL character sup¬ 
port assists in financial and statistical 
analysis. 

Extra also supports light pens, a fea¬ 
ture designed primarily for the health 


care industry. Command-line file trans¬ 
fer capability, for users more comfort¬ 
able with DOS syntax, allows 
concurrent transfer of files using mac¬ 
ros. A diagnostic trace facility allows 
users to record communications events 
on the network, to identify problems. 
Attachmate; $425. $75 (upgrades). 

Reader Service No. 14 


Cut and paste from X to MS 

An X server integrates the X Window 
System in a Microsoft Windows envi¬ 
ronment. Xoftware for Windows allows 
users to cut and paste between X and 
MS Windows applications and use MS 
Windows as a local window manager. 
It also features a complete on-line help 
system and several conveniences for 


February 1992 79 

















installing, configuring, and starting X 
applications from within the MS Win¬ 
dows environment. 

Xoftware for PC Unix, release 2.1, is 
an X server for 386/486-based machines 
running SCO or Interactive Unix. It adds 
support for Texas Instruments’ 34020- 
based graphics accelerator board and a 
hot-key function allows users to switch 
to SCOs Multiscreen application. Age; 
$495 (Xoftware for Windows), $595 
(Xoftware for PC Unix). 

Reader Service No. 15 

DOS-to-Unix connectivity 

Aterm software allows two-way file 
transfer between systems and can emu¬ 
late ANSI color terminals, graphics sup¬ 
port, multiple screen display, and a 
menu-driven interface. It features a hot¬ 
key function that allows seamless 
movement between DOS and Unix sys¬ 
tems. PC users can run DOS applica¬ 
tions and switch immediately to 
applications running on the Unix host. 
Aterm is compatible with MS-DOS and 
Windows. Specialix; from $125. 

Reader Service No. 16 


Signal-processing 
hardware and software 

Background data collection 

Easy Data collects data, modifies it, 
and sends it to the keyboard buffer 
without affecting normal keyboard 
functions. It imports data in the back¬ 
ground while users work with another 
program in the foreground. Easy Data 
works with most software that receives 
data entry through the keyboard. Key¬ 
board characters and macros can be in¬ 
serted automatically before and after 
each data field to simulate the same 
keys that would be pressed in manual 
data entry. Data can also be selectively 
parsed so that only the required data 
transfers. Labtronics; $145. 

Reader Service No. 17 

1,280-Mflops performance 

Two of Texas Instruments’ C40 DSP 
chips (see box on p. 79 and diagram 
below) are put to use in the Spirit-40 
AT, an 80-Mflops DSP engine. Each 40- 
Mflops C40 includes 1 Mbyte of SRAM 
on the main bus, 1 Mbyte of SRAM on 


the local bus, 64 Kbytes of boot 
EPROM, and memory expansion for up 
to 16 Mbytes of DRAM. 

With the C40’s six communication 
links, the Spirit-40 configures up to a 
six-dimensional hypercube engine with 
64 nodes. The board occupies a 16-bit 
ISA slot and is compatible with 386- 
and 486-based machines. Up to eight 
boards fit in a passive backplane ISA 
system, yielding 640-Mflops perfor¬ 
mance. Two backplanes yield 1,280- 
Mflops peak performance. Sonitech 
International; $8,995. 

Reader Service No. 18 

DSP package 

Filter designers can analyze the per- 
formance of their designs on Monarch, 
a menu-driven DSP package with filter 
design, signal analysis, and graphical 
capabilities. Users can design finite im¬ 
pulse response filters (up to a maxi¬ 
mum order of 512) or infinite impulse 
response filters. 

The package includes Siglab, a DSP 
language with over 100 mathematical 
and system operations for performing 



Sonitech's Spirit-40 AT 
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signal and system analysis. Designers 
can create their own algorithms and 
DSP operations, which can be synthe¬ 
sized, tested, and saved for future use 
in other applications. Graphical dis¬ 
plays feature overlay, zoom, grid, color, 
line style, and data location support. 
Monarch simulates real-world condi¬ 
tions, such as noise and fixed-point 
analysis. It requires MS-DOS 3-0 or 
higher, 640-Kbyte RAM, and a hard disk 
(or 2-Mbyte floppy). Dynacomp; 
$549.95, $10 (demonstration disk). 
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Communication 
hardware and software 

Controls 255 nodes 

COM20010 is a token-passing com¬ 
munications controller designed for 
high-speed intelligent data highways. 
The 2.5-Mbps device features a 
microcontroller interface and IK x 8 on¬ 
board buffer RAM. It uses an Arcnet 
protocol engine and a token-passing 
protocol for real-time deterministic per¬ 
formance that supports data packet 
sizes from 4 to 512 bytes. 

Up to 255 peer-to-peer nodes can 
network via the COM20010 at a data 
rate of 2.5 Mbps. System designers can 
implement peer-to-peer or master-slave 
communication schemes. The CMOS- 
fabricated device runs on a +5V power 
supply and comes in commercial and 
industrial temperature ranges. Standard 
Microsystems; $11.36 (1,000s). 
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LANs connect to WANs 

Local area networks connect into 
wide area networks with the LAN 
Transport Management System. LAN 
TMS, suited for Token Ring and 
Ethernet environments, eases intercon¬ 
nection between diverse LAN proto¬ 
cols, such as IBM Net BIOS, SNA, TCP/ 
IP, Novell Netware, IBM server, and 
Banyan Vines. Synchronous data link 
control pass-through capability enables 
it to merge a non-LAN serial data stream 
with regular Token Ring traffic for trans¬ 


mission over a common WAN link. 
Network administrators can monitor 
and troubleshoot problems on geo¬ 
graphically dispersed LANs from a cen¬ 
tral management console. General 
Datacomm; from $4,600. 
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Gateway software 

Up to 30 Netware users can access 
transmission control protocol and inter¬ 
nal protocol applications with Catipult 
gateway software. Its TCP/IP applica¬ 
tions run over IPX, using Novell’s Net 
BIOS. From the user’s PC, Catipult’s 
applications are routed to the OS/2 
gateway, which replaces IPX with the 
TCP/IP transport protocols and sends 
the application packets to other hosts. 
The product runs on OS/2 PCs 
equipped with a network interface card 
supported by Novell’s Netware Re¬ 
quester for OS/2 and over Ethernet, 
Token Ring, Arcnet, and Broadband 
networks running Netware 2.x or 
Netware 3.x. Ipswitch; $2,975- 
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Sparcstations link to IBM 

Two software products connect Sun 
Microsystems Sparcstations and IBM’s 
system network architecture (SNA) over 
a Token Ring. Brx TR/SNA transpar¬ 
ently supports all SNA services and sup¬ 
ports local network management. Brx 
TR/IP leverages the transfer functions 
of the Token Ring network to include 
Unix systems. The technology enables 
traditional TCP/IP applications such as 
NFS, File Transfer, Mail, and Remote 
Login to operate seamlessly over To¬ 
ken Ring. Users can access remote 
Sparcstations as if they were locally 
connected to the same network. 
Brixton Systems; $995 (both products 
and Token Ring card). 
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Special boards 

Video interface card 

The MZ-UT01 interface card estab¬ 
lishes a video rate path between the 


Scorpion real-time VGA frame grabber 
and Eighteen Eight Laboratories’ 
PL2500 floating-point array processors. 
Multiple 640 x 480 x 8-bit images can 
be transferred in real time between the 
video card and the array processor. The 
half-length mezzanine card mounts 
directly on the PL2500 and draws 
power and ground through Span 32 
bus interface connectors. Tire MZ-UT01 
card comes with software to control 
the Scorpion board, interconnection 
cables, and a reference manual. 
Univision Technologies; $2,495. 

Reader Service No. 24 

16.7 million colors 

The Volante family of graphics pro¬ 
cessors offers three alternatives to PC 
users. The AT2000, compatible with PC 
ATs, displays up to 16.7 million colors 
at 640 x 480 resolution. At higher reso¬ 
lution (up to 1,024 x 768) it displays 
32,768 colors. The MC1000, compat¬ 
ible with Micro Channel PCs, displays 
256 colors at 1,024 x 768 resolution. 
The AT800, also compatible with PC 
ATs, displays 256 colors. 

The processors, based on Texas In¬ 
struments’ TMS34020, have a refresh 
rate of 72 Hz and a maximum band¬ 
width of 73 MHz. Their graphical in¬ 
terface works with MS Windows, Tiga, 
X Windows, and Nova Graphics CGI. 
National Design; $1,295 (AT2000), 
$995 (MCI000), $795 (AT800). 
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National Design's AT2000, MCI000, 
AT800 


Flicker-free screens 

The Windows VGA 8800 graphics 
board boosts processing speeds for 
Windows 3.0 menus, windows, scroll- 
continued on p. 83 


February 1992 81 












New Products 


Mips has extended its RISC archi¬ 
tecture to the 64-bit format with the 
R4000 microprocessor, a 1.3-million 
transistor chip designed to handle the 
growing volume of data found not 
only in large processors but in desk¬ 
top machines as well. 

According to Andy Keane, prod¬ 
uct development manager for the 
R4000, company designers looked in 
the direction RISC architecture was 
moving and decided to target the 
volume desktop market. “The early 
focus (of RISC) was on large ma¬ 
chines, especially for databases,” 
Keane said. “Now, RISC is more 
broadly defined.” The R4000 is aimed 
at a range of applications, including 
high-end PCs, graphics workstations, 
database servers, and multiprocess¬ 
ing systems. 

As users grapple with increasingly 
large amounts of data, the current 
generation of 32-bit machines, which 
handles up to 4 Gbytes of informa¬ 
tion, is becoming obsolete. 

Based on observations that the 


64-bit RISC microprocessor 

memory requirements of the average 
program grow by a factor of 1.5 to 2 
each year, the company expects main¬ 
stream applications to exceed the capa¬ 
bility of 32-bit machines by the 
mid-nineties. Doubling the number of 
bits in the microprocessor enables it to 
handle up to a terabyte of data. 

Keane said his company is targeting 
PC users who are interested in the per¬ 
formance of a workstation but don’t 
want to give up their investment in soft¬ 
ware. The R4000 is compatible with 
R3000 and R6000 applications, as well as 
Windows NT, Santa Cruz Operation's 
Desktop, and RISC/os, Mips’ implemen¬ 
tation of Unix. The chip performs in a 
32-bit subset mode for most applica¬ 
tions, employing its larger address only 
when necessary. 

A 50-MHz clock drives the chip and 
operates a 100-MHz superpipeline. On- 
chip CPU components include a 64-bit 
integer processor, 64-bit floating- 
coprocessor, memory management unit, 
8-Kbyte instruction cache, 8-Kbyte data 
cache, control and management facili¬ 


ties for primary and secondary cache, 
and multiprocessing capabilities. 

The R4000 comes in three versions. 

• R4000PC, which supports primary 
on-chip cache, is a 179-pin grid ar¬ 
ray package aimed at desktop, low- 
end servers, and embedded control 
systems. 

• R4000SC, with secondary cache for 
uniprocessing applications, is 
offered in 447-PGA or land grid ar¬ 
ray packages intended for high- 
performance desktops and servers. 

• R4000MC, with multiprocessing 
features and secondary cache, also 
comes in a 447-PGA or -LGA 
package. 

Based on simulations of the SPEC 
benchmark suite, performance ranges 
from 40 Specmarks for the R4000PC 
to 60 Specmarks for the other two 
versions. According to the company, 
comparable Specmark perfonnance 
has previously been achieved only 
from implementations of five to nine 
chips. 

Among the development tools 
available are a C RISC compiler and 
a systems programmer package. The 
latter includes a cache memory simu¬ 
lator, an architecture simulator, and 
a development package. 

Through the Advanced Computing 
Environment Initiative, a consortium 
of software, systems, and semicon¬ 
ductor companies that promotes an 
open computing environment, Mips 
has licensed six semiconductor 
manufacturers to produce the R4000. 
Last year Olivetti, Acer, and Mips pre¬ 
viewed PC-size machines using the 
R4000. Mips expects commercially 
available machines using its micro¬ 
processor available in 1992. Mips; 
$700 to $1,200, through licensed 
manufacturers. 
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ing, and fonts up to 30 times faster than 
Super VGA standards, according to the 
manufacturer. A refresh rate of 70 Hz 
reduces screen flicker. Other features 
include 1 Mbyte of memory, 256 col¬ 
ors, and resolution up to 1,280 x 968 
pixels. Genoa Systems; $495. 

Reader Service No. 27 

CSBus-to-VMEbus link 

Sun Sparcstations can interface with 
a VMEbus system with the Model 467 
adapter. The adapter includes an SBus 
card and a 6U VME card, connected 
with a shielded cable. The systems 
communicate in two ways. Memory 
mapping permits the SBus and VMEbus 
host processor in one chassis to ex¬ 
ecute random access reads and writes 
to the destination bus as it would to a 
local memory. Alternatively, a built-in 
DMA controller penults transfers from 
the memory of one system to the other 
as fast as 20 Mbytes/s. Bit 3 Computer 
Corp.; $2,850. 
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Bit 3's Model 467 adapter 


Storage devices 

Solid-state disk emulator 

The Blue Flame III solid-state, non¬ 
volatile disk emulator boosts access 
speed by up to 20 times. The DOS- 
compatible card is an I/O mapped de¬ 
vice built from 14 SIMMs. It features a 
16-bit data path and transfers data up 
to 4 Mbytes/s. Capacities range from 2 
to 56 Mbytes. Each card fits in a full- 


length, 16-bit ISA bus slot. Semi Disk 
Systems; from $595. 
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20-Mbyte Mac floppy 

Quad Flextra is a very high density 
floppy-disk subsystem for the Macin¬ 
tosh with a 35-ms seek time and a data 
transfer rate of 1.25 Mbytes/s. Each 3.5- 
in. disk in the subsystem has a format¬ 
ted capacity of 20.4 Mbytes, enough to 
hold font libraries, large graphics, and 
desktop publishing files. The external 
floppy subsystem measures 2.25 x 6.81 
x 8.38-in., weighs less than 4 lbs., and 
has two SCSI connectors. The system 
includes a shielded cable, driver, util¬ 
ity software, and four disks. Quadram; 
$895. $25 (additional disks). 

Reader Service No. 30 

4.3-Gbyte tape drive 

Two quarter-inch cartridge tape 
drives store up to 2.15 Gbytes of 
uncompressed data or 4.3 Gbytes of 
compressed data. The 9200 and the 
9200C use a track density of 30 ser¬ 
pentine recording tracks and a record 
density of 67,733 bpi. 

The drives support a synchronous 
burst transfer rate of 4.8 Mbytes/s and 
a sustained host transfer rate of 400 to 
600 Kbytes/s for the 9200 and from 800 
to 1,200 Kbytes/s for the 9200C. Ac¬ 
cording to the manufacturer, each de¬ 
vice accesses any file on a tape in two 
minutes or less and specifies a non- 
recoverable error rate of no more than 
1 in 10 15 bits. Wangtek; $900 (9200), 
$1,100 (9200C) (OEM quantities.) 
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Winchesters for portables 

Portable, laptop, and notebook com¬ 
puters can support more complex ap¬ 
plications with the higher storage 
capacities of two Winchester disk 
drives. The MK-2024FC has a format¬ 
ted capacity of 86 Mbytes and an aver¬ 
age access time of 19 ms. The 
MK-2124FC holds 130 Mbytes and sup¬ 
ports 17-ms access. Both drives come 
in a 0.75 x 2.5-in., 6-oz. package and 


consume 1.8 watts when active and 
0.15 W when inactive. Each drive has 
a 32-Kbyte cache memory and a 5- 
Mbyte/s data transfer rate. Toshiba; 
$495 (MK-2024FC) in (OEM quanti¬ 
ties), $695 (MK-2124FC). 
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Laptop upgrades 

Laptop Solutions upgrades hard disks 
for Toshiba laptops and notebooks, to 
store up to 120 Mbytes of data. Each 
upgrade includes installation and par¬ 
titioning of the hard drive to user speci¬ 
fications, formatting of each logical 
drive, complete system diagnostics, and 
a one-year parts and labor warranty. 
The Houston-based company guaran¬ 
tees a 48-hour turnaround, including a 
24-hour burn-in and test. Laptop Solu¬ 
tions; from $695. 
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Product Summary 

Joe Hootman 

University of North Dakota 


Manufacturer Model Comments 


R.S.# 


Chips 

Cirrus Logic CL-GD6411 A 3-3V, one-chip VGA controller supports prototyping and system 80 

Graphics development for notebook computers with 64 gray shades on a 

controller monochrome LCD. The device can directly drive a 512-color, active- 

matrix LCD. Features include simultaneous LCD and CRT displays, 
host bus interface logic, RAMDAC, and memory control logic. 160- 
pin quad flat pack; $65 (100s). 

Cybernetic Micro Systems CY545B/CY500 The CMOS CY545B uses pulse and direction signals to generate up 81 

controllers to 27,000 steps/s for full-, half-, quad-, and microstep applications. 

The CY500 provides programmable acceleration slopes for appli¬ 
cations requiring 1 step/minute to 2,000 steps/s. Both accept ASCII 
or binary commands from a serial or parallel host. 44-pin PLCC or 
40-pin DIP from $25 (1,000s) (CY545); 40-pin LSI from $10 (1,000s) 
(CY500). 


Linear Technology 


LSI Logic 


Philips Semiconductors 


Pletronics 


Xicor 


LTC1235 

supervisor 


Sparkit-40/SS2 
chip set 


83C528 

controller 


Clock 

oscillators 


X24C00 

EEPROM 


Microprocessor supervisory circuit resets at IV and offers a condi- 82 
tional battery backup feature for RAM data. Available in commercial 
or industrial grades and 16-pin SO or plastic DIP. $3-85 (100s) 

(plastic, commercial). 

Manufacturers developing workstations compatible with the Sun 83 
Sparcstation 2 can sample the 40-MHz chip set, which includes the 
L64841 MMU, L64844 cache controller, and the L64846 DRAM 
controller. Manufactured for Sun and sold under license, the set 
works with three graphics controllers and SunSoft software. $844 
(100s). 

This 80C51-compatible microcontroller features a watchdog timer, 84 
32 Kbytes of ROM, and 512 bytes of on-chip RAM. The extra 
RAM allows the CMOS device to run compiled application pro¬ 
grams in PL/M and C languages and provides space for context 
switching for stack enhancements in internal memory. DIPs, 

PLCCs, or quad flat packs; EPROM and one-time-programmable 
versions available. 

85 

Line of 50- to 120-MHz clock oscillators produces HCMOS/ACMOS- 
compatible output and drives 15 standard TTL loads. The crystal 
operates in fundamental mode at all frequencies. DIPs and mini 
DIPS with plastic/J and Gull-Wing leads; from $3 (1,000s). 

86 

A 128-bit, CMOS serial device interfaces directly to a 2-wire serial 
bus and features software protocol that allows operation at a 1-MHz 
clock rate. 8-pin plastic DIP, type P, and plastic SOIC, type S; from 
45 cents (1,000s). 
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Manufacturer Model Comments 


R.S.# 


Boards 

Data Translation 


Micro Express 


Newer Technology 


Parsytec 


DT385 One-monitor, PC AT/EISA-compatible boards combine a frame 87 

image grabber and graphics processor to acquire, digitize, and display 

processors standard and nonstandard video signals on a VGA monitor. An 

on-board TMS34020 processor accelerates Windows 3.0 graphics 
display; permits scaling and variable screen placement; and 
speeds arithmetic, convolutions, and morphological operations. 

From $2,995, depending on memory. 


Hi-Color Turbo Designed around the Tseng Labs graphic chip set and the Sierra 88 

VGA card Semiconductor 15-bit DAC, the 32,768-color card supports 800 x 

600 and 640 x 480 resolutions, as well as 1,024 x 768 resolution 
with 256 colors. A Turbo switch permits the card to switch 
between zero- and one-wait-state operation. $185. 


fx/Overdrive Variable-speed (40-/55-MHz) accelerator, when running at its 89 

accelerator fastest setting, promises to speed stock Macintosh Ilfx perfor¬ 

mance approximately 40 percent. When teamed with 16-Mbyte 
SIMMs, the surface mount device lets the Ilfx act as a worksta¬ 
tion. The accelerator’s motherboard installation leaves the Nubus 
and PDS slots open. 


BBK-S4 Sbus/ Adapter board and accompanying software let Sparcstations and 90 
transputer compatibles serve as a standard host to large-scale transputer 

interface systems for data transfers up to 8.8 Mbytes/s. The T225 and 

controller-equipped Sbus slave also assists image processing, real¬ 
time, and other computation-intensive applications and links with 
up to four external systems. $3,950; quantity pricing available (10s). 


Software 

Microware Systems 


MIPS Computer Systems 


Slate Corporation 


Polytron source Utility package for automatic version control of application 91 

code control source code supports OS-9 and OS-9000 real-time operating 
systems. The set of 10 integrated modules supports multiuser 
development environments and features built-in file and user- 
access security. From $895; available on disk or tape. 


R3000 RISC microprocessor software development tools run on Sun-4/ 92 

Riscross tools Sun OS workstations and VAX/VMS systems. The cross¬ 

development tools include a K&R-compliant C compiler, System 
Programmer’s Package, Cache 3000 memory simulator, and SPP/e 
development package. $40,000 (set); from $9,000 (individual 
prices). 


Penbook Electronic book Author and Reader software lets users translate, 93 

compress, and store Postscript documents and read them on a 
pen-based computer. Documents displayed as mixed text/ 
graphics pages can be searched for user-defined words or 
phrases. A markup layer lets users annotate documents with 
personal notes. $695 (Author); $99 (Reader). 
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Product Summary 


Manufacturer Model 

Comments R.S.# 

Peripherals 

HDS Viewstation 

FX terminals 

X Window, RISC-based, 14-, 16-, and 19-inch, 256-color terminals 94 

feature an Intel i960 CPU and two ASIC processors for communica¬ 
tions and graphics integrated onto one board. Local Open Look and 

Motif window managers reduce network traffic and improve 
interactive performance. From $2,799. 

Mitsubishi P-78U 

video printer 

Monochrome autoscanning video printer promises to deliver 6x8- 95 

inch, A4-size prints with 1,280 horizontal dots per line in 256 shades 
of gray in 24 seconds. Composite, S-VHS, analog, TTL RGB, and 

Centronics parallel port input print in positive/negative, mirror- 
image. and multiformat print styles. $2,999. 

RGB Spectrum 1600U 

scan converter 

Video scan converter with zoom, antialiasing, and 24-bit color pro- 96 

cessing changes high-resolution computer graphics to television 
format in real time. The unit automatically synchronizes to computer 
displays with 20/90-kHz horizontal scan rates. An optional RS-232 
port lets users control all functions from a computer or ASCII 
terminal. 


Miscellaneous 

Aristo Computers Simcheck 

adapters 

Memory tester family adds four adapters (ZIP memory chips; PLCC 97 

and SOJ memory chips; bank; and Apple Macintosh Ilfx SIMM). 

Basic Simcheck tests standard SIMM and SIP memory modules with 

8 or 9 bits of 64K to 16 Mbytes. A two-line LCD shows instructions 
and test results. $99 to $345 (adapters), $995 (Simcheck). 

Lucas Duralith LDC100 

controller 

Controller with internal EEPROM, 8-bit A/D converter, and serial 98 

interface lets designers test the applicability of touch-screen tech¬ 
nology to their products by touching the screen at two opposite 
corners of the active display area. The controller is part of a devel¬ 
opment kit that includes a touch screen, two product development 
software disks, a user’s guide, 220V to 110V transformer and plugs, 
and appropriate cables and connectors. S695. 

Philips Semiconductors KMllOBH/lx 

sensors 

Hybrid magnetoresistive modules measure rotational speed down 99 

to zero using a toothed tachometer wheel with an inductive or Hall- 
effector sensor. Separations between tooth and sensor can be 
several millimeters, and tooth structure does not have to be defined 
closely. Samples available. 

Solectek Pocket fax 

modem 

Portable, 9,600/4,800-bps send/receive modem supports DOS, 100 

Windows, and Macintosh applications. Users send faxes from within 
applications via a pop-up fax menu; they may continue working in 
the foreground or leave the application by entering a telephone 
number and normal printing commands. From $299-95, depending 
on application. 
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Software Report 


continued from p. 9 

Micromachines will make it possible 
to create new medical equipment ca¬ 
pable of performing diagnosis and 
treatment simply, accurately, and with 
less need for surgery. They will also 
contribute to the advancement of 
microsurgery techniques (eye surgery, 
suture of microscopic blood vessels, 
and so on) and the development of arti¬ 
ficial organs to be placed inside the 
body. Biotechnologists will apply 
micromachines to cellular manipulation 
such as well separation, injections into 
cells, and cell fusion. 

Announcements 

NSF sponsors an active program for 
US scientists interested in working in 
Japan. As part of that program NSF pre¬ 
pared a directory of 150 Japanese com¬ 
panies that are willing to receive 
American researchers at their laborato¬ 
ries. The directory lists the company 
name, activity information, personnel, 
facilities, and research they will sup¬ 
port, along with contact information. 
For copies of the directory, write to Ja¬ 
pan Programs, Division of International 
Programs, National Science Founda¬ 
tion, Washington, DC 20550; fax (202) 
357-5839. 

The English summary of die outcome 
of the preliminary study on NIPT (or 
Sixth Generation Project) that I reported 
on in the April issue has been released. 
The Industrial Electronics Division of 
MITI published the 150-page “Report of 
the Research Committee on New Infor¬ 
mation Processing Technology” in 
March 1991. 


Reader Interest Survey 

Indicate your interest in this department 
by circling the appropriate number on 
the Reader Service Card. 

Low 195 Medium 196 High 197 


The 

Information 

Flood 

Trying to manage the flood of information that passes 
before you can be a frustrating experience. Potentially use¬ 
ful information can be lost when you lack the means to or¬ 
ganize and make sense of the multiple sources that arrive 
daily—as often as not, unbidden. 

IEEE Micro's 

On the Edge... 

... offers a solution to this continuing problem. 

In August, On the Edge begins a two-part tools discussion by James D. 
Gafford. The series will illustrate fairly simple ways for you to make use of 
sophisticated information management tools. The commercially available PC 
tools (MS-DOS or Macintosh) combine ease of use with information manage¬ 
ment power and flexibility. A common theme running through the series will 
be the creation and maintenance of a tool you can use to keep track of the 
information you read in IEEE Micro and other technical publications. 

LOOK FOR THE AUGUST ISSUE 
of IEEE Micro 

It will help you manage the information flood while 
gaining a better grasp of software tools and soft¬ 
ware issues in general. 
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Coming 

Next Issue 

The April Issue of IEEE Micro features articles se¬ 
lected from presentations at the third annual Hot 
Chips Symposium sponsored by the IEEE Computer 
Society’s Technical Committee on Microprocessors 
and Microcomputers. 

Don’t miss 

Hot Chips III 

Read the April 1992 Issue of 
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For more information and ordering contact: 
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Hermes , the European space shuttle 


anned space travel has long fueled hu¬ 
man imagination, science fiction books, 
and heated debates about its utility and 

funding. 

Visionary authors like Jules Verne were well 
aware of the latter problem. In his book, From 
the Earth to the Moon (1865), Verne envisioned 
the difficulty of multinational funding for a moon 
projectile. The German confederation was short 
of money and could only contribute 34,285 flor¬ 
ins. For the Vatican, the idea came too soon, just 
after the rehabilitation of Copernicus in 1822. 
Switzerland only granted 257 Swiss francs, since 
the Swiss did not feel they could tie in any trade 
relationship by shooting a cannonball to the 
moon. The English did not participate at all, since 
they considered the project incompatible with 
their principle of nonintervention. France was a 
driving force behind the Gun Club project, but 
the largest part of the funding came from Russia 
and the United States. 

The European program 

The funding of the European manned space 
program in 1992 follows this 1865 scheme. The 
Gun Club’s successor is the ESA, or European 
Space Agency. On the table of negotiation are 
three major projects involving manned space 
flights: the European Hermes space shuttle , the 
Columbus manned station, and the Ariane 5 
rocket. The ESA counts 13 members and two 
associates, but the main actors are the economi¬ 
cally stronger countries of France, Germany, and 
Italy. England is absent. Switzerland accounts for 
2 percent of the funding. The unit of negotiation 
is not the dollar, but the ECU (European count¬ 
ing unit). One ECU equals approximately US$1.4. 

The key component of the project is the 
Hermes shuttle, 1 which should carry a crew of 


three and a payload of 3 tons to low orbit. Its 
development cost is estimated at 6,200 million 
ECUs. Hermes is primarily designed to serve the 
Columbus Man-Tended Free Flyer (MTFF) space 
station scheduled for the year 2001. It could also 
serve the US Freedom space station. The Co¬ 
lumbus attached, pressurized module (scheduled 
for 1998), which is the European contribution to 
the Freedom international program, will dock to 
the Freedom space station. Docking to the Rus¬ 
sian MIR space station is also foreseen. Estimates 
of Columbus s cost are 3,700 million ECUs. 
{Freedom's budget is 14,000 million ECUs, or 
US$19,700 million.) 

The Columbus precursor program foresees 
different Spacelab flights with the US space 
shuttle. This program involves auxiliary projects 
like the Poem-1 polar orbit station or data relay 
satellites. 

In a clever move, the ESA decided not to de¬ 
velop a special launcher for the Hermes shuttle, 
but to share the Ariane 5 launcher with com¬ 
mercial satellites. Besides reducing development 
costs, this sharing produces two nice side ef¬ 
fects. First, one can build on the experience ac¬ 
cumulated with commercial satellites; second, 
the usefulness of the launcher does not come 
into question. 

In fact, the unmanned European space pro¬ 
gram remains unchallenged. The Arianespace or¬ 
ganization is responsible for the Ariane flights. 
Ariane ’s clients include 90 percent of the world’s 
satellite operators. They earned 20 million ECUs 
last year and received orders for 5,000 million 
ECUs. Ariane ’s 100th stage just came off the 
assembly line. (It is the 50th rocket.) So, Hermes 
will be able to ride on that wave, although the 
launch capability of Ariane 5 had to be boosted 
to put the 24.4 tons of Hermes in low orbit. The 
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Ariane 5budget also increased to 3,500 
million ECUs, but it seems secure now. 

The ESA prepared a long-term plan, 
but the member states want to approve 
the yearly budget. Budget approval pre¬ 
sents the main difficulty for the ESA— 
a similar situation to that of the US 
space station. 2 In November 1991 the 
ESA asked the member states to spend 
the impressive sum of 39,000 million 
ECUs and to make a yes or no deci¬ 
sion on the Hermes program. To de¬ 
crease the yearly budget, the ESA 
proposed to build only one shuttle, 
with the maiden flight postponed to 
the year 2002, one flight per year (in¬ 
stead of two), and operational capa¬ 
bility by 2004. At their November 1991 
meeting in Munich, the ministers pre¬ 
ferred to adjourn the decision instead 
and await further studies. 

The hesitation of the ministers just 
reflects the taxpayer’s mood in the dif¬ 
ferent countries. While 60 percent of 
the French favor manned space flights 
without reserve and consider it a mat¬ 
ter of national pride, they would like 
the others to pay for it also. The Ger¬ 
mans are more concerned with down- 
to-earth themes like environment and 
reunification costs, and the Italians sim¬ 
ply lack the money. 

But the usefulness of manned space 
flights is being questioned again—and 
not without reason. Indeed, the last 30 
years since Yuri Gagarin’s flight have 
shown that humans can contribute little 
in space. The Soviet Progress space¬ 
craft demonstrated that automatic ren¬ 
dezvous in orbit was reliable, and the 
Luna probe brought moon rocks back 
to the earth at a fraction of the Apollo 
project’s costs. Human flights are re¬ 
stricted to low earth orbit, but money 
is made on the geostationary orbit. And 
for astronomical or earth observation, 
nothing is as disturbing as a human 
being moving or sneezing inside the 
spacecraft. 

But for the public, manned space 
travel is tied to emotion, pride, and 
dreams, and it is ready to pay for them. 
Despite the end of the Cold War, many 
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see space exploration as a competition, 
not unlike the America’s Cup race. And 
this makes manned space travel prob¬ 
ably the most important psychological 
motor to the development of spacecraft. 

In 1992, six Europeans plan to enter 
space aboard foreign spaceships: three 
US shuttle flights ( Discovery once and 
Atlantis twice) with Spacelab on board 
and two Russian missions to the Mir 
station. The real question is: Does Eu¬ 
rope need its own shuttle when space 
tickets are already on sale? 


The usefulness 
of manned space 
flights is being 
questioned 
again—and not 
without reason. 


And the Russians argue that the Eu¬ 
ropean shuttle already exists: They 
would like to sell their Buran (Bliz¬ 
zard) shuttle as an alternative to 
Hermes. This shuttle already performed 
its automatic maiden flight. Buran' s 
next flight is scheduled for 1993, but 
budget restrictions and the political situ¬ 
ation could delay it. 

It is unlikely that the ESA will accept 
the Russian offer, because Hermes ben¬ 
efits are not found in operating it but 
in building it. The project allows the 
European industry to become ac¬ 
quainted with new space technologies. 
It would give work to 16,000 persons 
in the next 10 years. 2 It should form 
links between countries in another 
European project and possibly extend 
them to the Eastern countries. The 
Russians have already offered precur¬ 
sor flights on their simulators and train¬ 
ing facilities. For industry and 


universities, Hermes presents a chal¬ 
lenge, a test bank, and an attraction 
pole for qualified engineers. 

Is a space shuttle the correct answer 
to economic space flights? The fact that 
Russia also built a shuttle is no proof 
of its usefulness: In strategic games, one 
covers each move of the adversary. The 
US shuttle failed in at least two aspects. 
It was neither cheaper nor easier to 
operate than expendable rockets. And 
because the US shuttle must be 
manned, one failure delayed the whole 
US space program for two years. 
Hermes and Buran flights do not need 
to be manned, but a failure, even in 
the first unmanned mission, could stop 
the program as well. 

The next generation is already on 
the drawing board: one- or two-stage 
spaceships that use air during atmo¬ 
spheric ascent and switch to rocket 
mode to reach orbit, like the British/ 
Russian Hotol, the German Saenger ; or 
the French Star-H. But the way to such 
spacecraft comes from mastering the 
technology, and this shall be worth the 
cost of Hermes. The fact that the ESA 
budgeted only one Hermes spaceplane 
shows that it is nothing more than a 
prototype. 

Fault-tolerant 

multiprocessor 

For the computer architects, the most 
challenging part of the Hermes project 
is its on-board computer, called the 
SEF. 3 This fault-tolerant system, devel¬ 
oped by Matra Marconi Space 
(Toulouse, France), will be the core of 
the Hermes avionics and support guid¬ 
ance, communication, and the mission 
itself. 

The fault-tolerant computer system 
consists of a pool of four computers 
interconnected by serial links. It looks 
very similar to other avionics comput¬ 
ers like the US space shuttle Primary 
Computer (1974), the SIFT (1978), or 
the FTMP (1978) computers. The ar¬ 
chitecture has not changed much in 
the last 17 years. Why should it? La 
fonction fait I’objet. 












The most critical function supported 
by the SEF is the guidance, navigation, 
and control (GNC) of the spaceplane. 
GNC is normally a repetitive task: Read 
the input sensors, process the input 
data, and generate the command to the 
actuators. The GNC computer uses a 
high-performance RISC processor (the 
Sun Microsystems Sparc is a candidate) 
with 2 Mbytes of memory to provide 4 
MIPS of processing power. The boards 
of the GNC computer interconnect via 
a Nubus (IEEE Std. 1196). 

The main input sensors are the iner¬ 
tial navigation system, the global posi¬ 
tioning system, and the radio altimeter. 
A set of sensors connects to each of 
the four GNC computers through a 
dedicated bus. The critical communi¬ 
cation link to the Ariane 5 system is 
duplicated. 

To cover the time-critical flight 
phases, the computer masks errors 
rather than using time-consuming re¬ 
covery methods. To this effect, four 
processors execute the GNC algorithms 
in parallel. The processors are synchro¬ 
nized to operate on the same input data 
set to ensure that they do not diverge. 
Each computer reads its inputs and 
sends their values to the other three 
computers over the 7-Mbps serial 
interprocessor link. The computers 
reach a consensus on the input data, 
process the data, and broadcast the 
result, so as to reach a consensus on 
the output value. Only then is the value 
forwarded to the actuators. 

To offload the application proces¬ 
sors from synchronization and match¬ 


ing, a dedicated processor, called Data 
Manager, handles the four serial 
interprocessor links between the com¬ 
puters. These links, called the Interpro¬ 
cessor Network, provide a reliable 
clock synchronization with a 20-ms 
period. This new approach responded 
to recognition that synchronization is 
a critical and time-consuming function, 
especially when executing Byzantine 
agreements. 

This arrangement can still fail be¬ 
cause of common-mode errors. The 
most obvious is that the same software 
error may affect all computers. There¬ 
fore, the programs running in the dif¬ 
ferent processors may be diversified, 
for example, written by different per¬ 
sons. So it becomes unlikely that the 
same error will affect all computers. 
This technique is called N-version pro¬ 
gramming. It is used today, for instance, 
in the Airbus 320 computers. 

This is also a new approach for an¬ 
other reason. Previous projects, such 
as the US space shuttle, did not fore¬ 
see software diversity or let it execute 
on a distinct computing system. A rep¬ 
resentative prototype based on func¬ 
tional models of this fault-tolerant 
computer pool architecture has already 
been developed by Matra Marconi to 
validate the concept. Final space quali¬ 
fication will be the real challenge of 
the SEF development, especially with 
respect to the tools and methods in¬ 
volved. Too often, fault-tolerant com¬ 
puters have been a pill in search of a 
disease. The SEF offers a unique op¬ 
portunity to apply the theory of fault- 


tolerant computing where it is really 
needed. This is one of the merits of 
the Hermes project. 

What is the future of Hermes? The 
Greeks named Hermes the god of elo¬ 
quence, trade, and thieves. The future 
will show which name really applies, 
if not all three. 
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Consciousness 


B egular readers of this column have seen 
several reviews of books about human 
mental processes. In October 1988 I 
talked about Johnson-Laird’s frustrating The Com¬ 
puter and the Mind. Last August I reviewed 
Penrose’s The Emperors New Mind , a work 
whose main conclusion about the inability of 
artificial intelligence (AI) to replicate human be¬ 
havior depends on an as yet undiscovered theory 
of quantum gravitation. Then last October I 
looked at Boden’s The Creative Mind , a work 
that takes a contrasting view of AI and fails to 
disagree completely with Penrose only because 
of a last-minute appeal to human chauvinism. 

This time I have looked at a work that attempts 
no less a task than the complete explanation of 
consciousness. It is a serious and scholarly explo¬ 
ration of the great central problem of psychology 
and philosophy, the mind-body problem. I pre¬ 
dict that many people will buy it, but few will 
read it, and fewer still will understand and adopt 
the viewpoint that it teaches. 

Consciousness Explained . Daniel C. Dennett 
(Little, Brown, Boston, 1991, 524 pp., $27.95) 
The first thing I have to tell you about this 
book is that its explanation of consciousness, by 
the author’s own admission, is sketchy. Dennett 
tries to explain the most important physiological 
and psychological facts and to answer the most 
widely known philosophical arguments that con¬ 
tradict his point of view. At many points, how¬ 
ever, he simply gives the shape of the theory, 
leaving to further research the fleshing out of 
details. As he says in his appendix for scientists, 
in which he proposes several experiments, 

Since as a philosopher I’ve tried to keep 
my model as general and noncomittal as 


possible, if I’ve done my job right, these 
experiments should help settle only how 
strong a version of my model is con¬ 
firmed; if the model were entirely 
disconfirmed, I would be well and truly 
refuted and embarrassed. 

Dennett is a wonderful storyteller. His ability 
to make points and cut through jargon with well- 
chosen analogies, anecdotes, and caricatures is 
breathtaking. Again and again as the going gets 
rough, he finds a way to bring the discourse 
back to an arena in which the reader feels at 
home. For example, in discussing color vision, a 
favorite topic in philosophical discourses on 
human mental processes, he mentions the 
Rosenbergs’ tom Jell-O boxes. Two spies could 
identify themselves to each other by producing 
the torn halves of a Jell-O box. Neither halfs 
pattern has any intrinsic significance, but each 
matches the other perfectly. This story cuts 
through philosophical jargon that goes back 
hundreds of years and makes clear immediately 
Dennett’s view of why there are colors. 

In The Selfish Gene (Oxford, 1976), Richard 
Dawkins coined the term meme to describe com¬ 
plex idea units (like wheel, alphabet, calculus, 
the Odyssey , the theme from the slow move¬ 
ment of Beethoven’s Seventh Symphony). Memes 
are central to Dennett’s view of consciousness. 
He says, 

Human consciousness is itself a huge 
complex of memes (or more exactly, 
meme-effects in brains) that can best be 
understood as the operation of a “von 
Neumannesque” virtual machine imple¬ 
mented in the parallel architecture of a 
brain that was not designed for any such 
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activities. The powers of this 
virtual machine vastly enhance 
the underlying powers of the 
organic hardware on which it 
runs. But at the same time 
many of its most curious fea¬ 
tures, and especially its limita¬ 
tions, can be explained as the 
by-products of the kludges that 
make possible this curious but 
effective reuse of an existing 
organ for novel purposes. 

This certainly sounds like strong AI, 
the doctrine so vehemently opposed by 
Penrose. If you allow this definition to 
stand, you have to accept conscious 
machines or, alternatively, Dennett’s 
assertion that we’re all zombies. That 
is, Dennett believes there is no con¬ 
sciousness of the mysterious (epiphe- 
nomenal) sort posited by people who 
say that human beings must be more 
than “mere” Turing machines, no mat¬ 
ter how closely machines can simulate 
their behavior. These are views, says 
Dennett, that do not deserve to be dis¬ 
cussed with a straight face. 

One of the most important things to 
learn from Dennett’s book is how to 
apply his highly counterintuitive point 
of view. Most people who introspect 
about their minds tend to picture a 
control room in which a self gathers 
together sensory inputs and remem¬ 
bered information. The self uses these 
to make decisions and issue commands 
to the mechanisms that control speech, 
movement, and so forth. For example, 
in speaking of the effect of the “blind 
spot” where the optic nerve passes 
through the retina, many writers say 
that the brain fills in the missing part 
of the field of view. This view makes 
no sense unless there is an internal 
projection of the visual field and an 
internal observer viewing that projec¬ 
tion in the control room. Descartes 
placed the control room in the pineal 
gland, but no serious modern thinker 
believes the control room model cor¬ 
responds to actual brain function. 

I don’t want to give a detailed ex¬ 


planation of Dennett’s replacement for 
this model. To me, his view of a per¬ 
son is like a roomful of monkeys at 
typewriters putting out a single news¬ 
letter. He regards the self as a construct, 
like center of gravity, whose usefulness 
breaks down if you get too close. In 
fact, he uses the term “narrative center 
of gravity.” 

While most thinkers reject the Car¬ 
tesian control room model, many let it 
enter implicitly into their arguments, 
especially in “thought experiments.” 
Thought experiments are parables, like 
Searle’s Chinese Room (see Micro Re¬ 
view, August 1991). The philosopher 
devising the thought experiment asks 
you to imagine a situation that is pos¬ 
sible in principle but usually impos¬ 
sible in practice. Then you are asked 
to follow a handwaving argument that 
leads to the point the philosopher is 
trying to make. Dennett ridicules a few 
notorious thought experiments. These 
exercises are good examples of the 
application of his model. 

This review of Dennett’s densely 
packed 500 pages is necessarily sketchy 
and incomplete, and it may not give 
much inkling of the excitement I felt 
while reading it. If you are interested 
in this subject, you should invest the 
time necessary to read and understand 
Dennett's book. 

Macintosh utilities 

Every year after the MacWorld Expo 
in San Francisco in January, I receive a 
large number of Macintosh programs 
to review. This year there seemed to 
be a better selection of utility programs 
than I’ve seen in previous years. 

Now Utilities, Version 3-0 (Now Soft¬ 
ware, 520 S.W. Harrison St., Suite 435, 
Portland, OR 97201; (503) 274-2800, 
$129) 

The Now Utilities is a package of 10 
programs. The company has tried to 
cover all the bases, but other manu¬ 
facturers provide better products for 
some of the functions. I think the best 
parts of the package are the Now 


Menus and Super Boomerang. 

Now Menus allows cascading up to 
five levels. This is ideal for use with 
the Apple menu under System 7, since 
the most natural way to organize Apple 
menu items is in nested folders. Super 
Boomerang remembers applications 
and documents that you have used 
recently and makes them instantly avail¬ 
able by slightly modifying the opera¬ 
tion of the file-selection dialogs used 
by all Macintosh application programs. 

WYSIWYG Menus is another useful 
program. It causes each entry in a font- 
selection menu to appear in characters 
from the font named by the entry. Of 
course, this program has a few draw¬ 
backs. For example, the names of fonts 
like Symbol or Zapf Dingbats are ren¬ 
dered in greek or in meaningless 
pictures. 

Startup Manager lets you determine 
which startup programs to use and in 
which order. This program can be very 
helpful in debugging startup conflicts. 

The other programs of the Now Utili¬ 
ties are also useful, but you don’t have 
to use them. Each of the utilities can 
be installed separately. Even if you only 
use a few of them, you’ll still get your 
money’s worth. 

Alsofit Power Utilities (Alsoft, Inc., PO 
Box 927, Spring, TX 77383-0927; (713) 
353-4090, $129) 

Alsoft’s package of seven utilities is 
more narrowly focused than Now’s. 
Four of the utilities pertain to disk op¬ 
eration. The others are a menu utility 
that is similar to Now Menus, a screen 
dimmer, and the partially obsolete (for 
System 7 users) Master Juggler. 

Disk Express II keeps the files of your 
hard disk stored as efficiently as pos¬ 
sible. It reorganizes files on demand 
or once per day in the background. It 
also removes fragmentation and keeps 
frequently used files together. It per¬ 
forms its job one file at a time, so that 
it is interruptible and little damage is 
done if it crashes. 

continued on p. 79 


April 1992 7 














Guest Editor’s Introduction 

Hot Chips III 


Norman P. Jouppi 

Digital Equipment 
Corporation 



he annual Hot Chips Symposium pre¬ 
sents the most current and exciting 
chip developments, as well as work 
in progress. The recent third meeting 
again boasted record attendance and required 
moving to the largest auditorium at Stanford Uni¬ 
versity. The authors of seven of the most inter¬ 
esting and technically solid presentations were 
invited to submit papers for this special issue of 
IEEE Micro. Six authors agreed to submit papers, 
three were able to, and two appear here. In addi¬ 
tion, this issue also carries an article detailing a 
recently developed and indisputably “hot” chip 
that was not presented at the symposium. 

A theme running through the three special is¬ 
sue articles is that of exploiting parallelism for 
higher performance. Each product exploits par¬ 
allelism in a different way. 

Authors of the first article explain how the Mips 
R4000 exploits instruction-level parallelism 
. through superpipelining. Superpipelining refers 
to the further pipelining of what are normally 
fundamental single-cycle operations in a pipe¬ 
lined machine. For example, on-chip cache ac¬ 
cess usually occurs in one cycle in pipelined 
microprocessors (not counting the usual cycle for 
address calculation). In contrast, the R4000 cache 
access is pipelined into three stages: two for cache 
access and one for tag comparison and control. 
The R4000 also provides support for a moderate 
degree of multiprocessing. 

Next is a message-driven processor used in the 
J-machine at MIT. The message-driven processor 


exploits parallelism through large-scale and fine- 
grain multiprocessing. Each processor can deliver 
a message and dispatch a task to handle it with a 
latency of under two microseconds. In compari¬ 
son, this is within a order of magnitude of the 
time required for a main memory access in most 
computers. The architecture and hardware de¬ 
sign of the J-machine supports up to 4,096 
processors! 

The third article describes the 88110 from Mo¬ 
torola. This microprocessor makes use of instruc¬ 
tion-level parallelism by issuing multiple 
independent instructions in the same cycle (that 
is, a superscalar approach). The 88110 adds graph¬ 
ics support and a floating-point register file to 
the 88000 architecture. Many organizational fea¬ 
tures add to performance, such as the out-of-order 
issue of stores, nonblocking caches, and 10 inde¬ 
pendent functional units. All but one of the func¬ 
tional units can begin a new operation each cycle. 
The 88110 also provides support for a moderate 
degree of multiprocessing. 

Hot Chips IV is already in the planning stages. 
It takes place August 9-11, 1992, at Stanford Uni¬ 
versity. If you’d like to submit a presentation, con¬ 
tact Bob Miller at (510) 642-6037 (bmiller@ginger. 
berkeley.edu). If you’d like more information 
about the 1992 symposium, contact Glen Langdon 
at (408) 459-2212 (langdon@cse.ucsc.edu). 

I thank those reviewers who helped referee 
submissions with an amazing one- or two-week 
turnaround, and all the other people who helped 
with this issue. P 
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Call for Articles 

Advanced Packaging and 
Interconnect Technology 

April 1993 Special Issue 

IEEE Micro 

The April 1993 special issue of IEEE Micro will fea¬ 
ture articles on advanced packaging and intercon¬ 
nect technology, as a companion issue to the April 
1993 Computer special issue on multichip modules. 
The guest editor solicits manuscripts in the areas of 

• critical packaging trends and issues; 

• substrate and package technologies—for ex¬ 
ample, flexible, glass, or diamond substrates, 
few-chip packaging, or 3D packaging; 

• attachment, bonding, and connection tech¬ 
nologies, including fine-pitch surface mount, 
laser applications, known-good die, and inter¬ 
connection trade-off analysis; 

• system-level issues such as test, performance 
modeling, cooling, and EMI; and 

• materials technology. 

Authors should submit six copies of an original manu¬ 
script by July 1,1992; notification of decisions is set for 
October 1,1992; and the deadline for submission of the 
final version of each manuscript is December 1, 1992. 
For author guidelines, contact Clair Azada, Computer 
Society West Coast Office, (714) 821-8380. 


Direct submissions and questions to Guest Editor 
David Misunas, MCC, 12100 Technology Blvd., Austin, 
TX 78727, phone (512) 250-3045, fax (512) 250-3045. 
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The Mips R4000 Processor 


Computer architects estimate that the current generation of 32-bit machines will be obsolete 
by 1997. The R4000 employs a 64-bit architecture, using 64-bit registers and generating 64-bit 
virtual addresses. Superpipelining techniques allow it to process more instructions simulta¬ 
neously than the previous generation of microprocessors. Specmark ratings indicate it per¬ 
forms higher than other single-chip microprocessors. 


Sunil Mirapuri 


Michael Wood acre 


Nader Vasseghi 


Mips Computer Systems 


he R4000 is a highly integrated, 64-bit 
RISC microprocessor that provides a 
simple solution to the increasing de¬ 
mands on the size of address space, 
while maintaining full compatibility with previ¬ 
ous Mips processors. Its primary features include 



• on-chip CPU, FPU, MMU, primary caches, 
and system interface logic (See Figure l), 1 

• superpipelining techniques, 

• on-chip secondary cache control logic with 
a flexible interface, 

• a programmable system interface for high- 
performance multiprocessor servers and low- 
cost desktop systems, 

• flexible multiprocessor support, and 

• 1.2 million transistors implemented in CMOS 
technology. 


In addition, the R4000 , s single-chip implementa¬ 
tion makes it easier to scale the clock as technol¬ 
ogy improves. According to SPEC benchmark 
tests, it achieves the highest performance of any 
microprocessor chip. 


A 64-bit architecture 

With programs growing by one-half to one bit 
of address space per year, 2 a greater than 32-bit 
address space should be useful by 1993 and re¬ 
quired by 1997. In creating the 64-bit R4000, de¬ 
signers extended the R3000 architecture by 
increasing the data word size and virtual address 
space. This design entailed widening the machine 
registers and data paths, and sign-extending 32- 
bit data when loading into registers. Since certain 
operations work differently on 64-bit data than 
on sign-extended 32-bit data, we added additional 
instructions for 64-bit data, including integer loads, 
stores, adds, subtracts, shifts, multiplies, divides, 
and coprocessor moves. 

The chip also supports a 64-bit virtual address 
space with wide virtual address data paths. It 
stores 32-bit addresses as 64-bit entities in sign- 
extended form and stores the results of address 
computation on these entities in sign-extended 
form. Thus it continues to support the previous 
32-bit architecture’s addressing. 3 

The hardware cost of extending the architec¬ 
ture to 64 bits was about 7 percent of the die 
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area. A longer, 64-bit ALU stage repre¬ 
sents the cycle time speed penalty. 

CPU pipeline 

The R4000’s eight pipeline stages al¬ 
low it to process more instructions at once 
than can the R3000’s five-stage pipeline. 4 
Superpipelining has split the instruction 
and data memory references across two 
stages. Consequently, we could distrib¬ 
ute the logic more evenly across pipe¬ 
line stages. (See Figure 2.) The 
single-cycle ALU stage takes slightly more 
time than each of the cache access stages. 

Although the superpipeline increases 
the cycles per instruction due to longer 
branch and load delays, it greatly im¬ 
proves the achievable cycle time. Fu¬ 
ture increases in cache size will not 
require a fundamental redesign of the 
superpipeline. We considered super¬ 
scalar design as another way to increase 
instruction-level parallelism, but our 
studies showed that with current tech¬ 
nology the chip could perform higher 
with a less complex superpipeline. 

Figure 3 on the next page shows op¬ 
timal pipeline movement, completing 
one instruction every internal clock 
cycle. The internal, or pipeline, clock 
rate of the R4000 is twice the external 
input, or master, clock frequency. 

The processor accesses the instruc¬ 
tion cache during the instruction first 
(IF) and instruction second (IS) stages, 
with a new cache access starting every 
cycle. The MMU translates the instruc¬ 
tion virtual address into a physical ad¬ 
dress during these stages. The 
instruction bits available at the begin¬ 
ning of the register file (RF) stage are 
decoded and used to access the regis¬ 
ter file. Also at this time, the tags read 
from the instruction cache are com¬ 
pared with the physical address to de¬ 
termine whether the instruction cache 
access was a hit. If so, the instruction 
can advance to its execution (EX) stage. 
For nonmemory operations, the 
instruction’s result is available by the 
end of the EX stage. 

In the data first (DF) and data sec¬ 
ond (DS) stages, the R4000 accesses 




sc 

DVA 

System control 

Data virtual address 

IVA Instruction virtual address 

FP Floating point 


Figure 1. R4000 internal block diagram. 



Figure 2. R4000 pipeline activities. 
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Figure 3. R4000 pipeline and instruction overlapping. 


the data cache, with a new access starting every cycle. The 
MMU translates the data virtual address into a physical ad¬ 
dress during these stages. In the tag check (TC) stage, the 
R4000 compares the data tags from the cache tag array with 
the translated address to determine if the data cache access 
was a hit. For stores, if the tag check passes in TC, the data 
travel to the store buffer and enter the data cache the next 
time cache bandwidth is available. Instructions finally go to 
the write back (WB) stage where the data are written to the 
register file if necessary. 

Load interlocks and branch instaictions disrupt the normal 
flow of the pipeline. For loads, the data are not ready until 
the end of the cache access in the DS stage. If any of the two 
instructions after a load use the result of the load in their EX 
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Figure 4. Load interlock/slip cycle. 


stages, the hardware interlocks and slips. As shown in Figure 
4, during the slip the DF, DS, TC, WB stages of the pipeline 
advance while the IF, IS, RF, EX stages do not. For the load 
interlock, this permits the load instruction to advance and 
complete its cache access, while the instruction that depends 
on the load remains in the EX stage. 

The result of a branch condition check and a branch target 
address calculation are not known until the end of the EX 
stage. (See Figure 5.) By that time, up to three subsequent 
instaictions have entered the pipeline. If the branch is not 
taken, the processor can continue to execute all instructions 
that have entered the pipeline with no penalty. If the branch is 
taken, the processor accesses instructions at the branch target 
address. For taken branches, the Mips architecture allows one 
instruction after the branch to complete before execut¬ 
ing the branch target instruction. The other two instruc¬ 
tions that have already entered the pipeline are nullified. 
We considered a branch target scheme that prefetches 
instructions from both paths of a branch, producing a 
A smaller branch penalty. However, implementation con- 

B straints required the simpler approach without a 

prefetching scheme. 

Results of instructions that have completed their ex- 
^ ecution, but have not yet written their results into the 

^ register file, may be bypassed as operands for subse- 

F quent instructions. 

Integer data path 

The R4000’s 64-bit execution unit includes a 64-bit 
— register file, load aligner, ALU, shifter, multiplier, and 
divider. The 64-bit data path supports extended ad- 
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dressing without the use of long pointers or segment 
registers. 5 

The ALU stage, EX, was a speed-critical path. To shorten 
the cycle time, the ALU comprises an adder and a logical 
unit. The 64-bit, carry-select adder manipulates all 32-bit op¬ 
erands as sign-extended, 64-bit operands. It also performs 
address calculations for loads, stores, and branches, and is 
used in integer multiply and divide. 

R4000 provides hardware support for integer multi¬ 
ply and divide. It uses a 2-bit Booth algorithm for inte¬ 
ger multiplication and breaks each iteration into four 
stages: Booth decoding, multiplicand selection, partial 
product generation, and product accumulation. The 
carry-save adder (CSA) adds intermediate partial prod¬ 
ucts, and two separate 64-bit registers Hi and Lo store 
the final product. 

The multiplier cycles at twice the pipeline clock frequency 
to produce two sums for each pipeline cycle. Since the R4000 
uses a CSA, the multiply results are in a sum-and-carry form 
and must be combined through full carry propagation. The 
integer ALU performs this operation when the result moves 
to the general registers. Integer multiply latency is 10 pipe¬ 
line cycles for 32-bit operations and 20 pipeline cycles for 64- 
bit operations. 

Divides use a 1-bit-per-iteration, nonrestoring algorithm. 
This algorithm leaves the quotient in a signed-digit form 
that must be converted back to a binary representation and 
possibly corrected at the end of the divide. Divides use the 
main integer adder for the remainder add or subtract opera¬ 
tions, thus preventing the instructions from entering the pipe¬ 
line during a divide. The implementation takes two pipeline 
cycles per iteration; each iteration resolves 1 bit of divi¬ 
dend. The latencies are 69 pipeline cycles for a 32-bit di¬ 
vide and 133 pipeline cycles for a 64-bit divide operation. 
We found this performance sufficient, due to the infrequent 
occurrence of the integer divide operations. 

The integer shifter performs immediate or variable shifts from 
zero to 63 places. We designed the shifter to shift up to 32 bits in 
one cycle, making it half the size of a 64-bit shifter. To accom¬ 
plish shifts greater than 32 bits, the pipeline slips for one cycle 
while forcing a 32-bit shift in the EX cycle. In the next cycle, the 
shifter performs the remainder of the shift. A trade-off between 
area and performance led to this decision. 

The register file is a 32-entry by 64-bit array with two read 
ports and one write port. It can read and write in the same 
cycle. In the case of reading and writing the same location in 
the same cycle, the R4000 provides local bypassing of the 
write data into the read bus. 

Floating-point unit 

The FPU implements the IEEE Std 734-1983. 6 Its three 
functional units—multiplier, adder, and divider—operate on 
single- and double-precision operands. While the FPU ex- 
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Figure 5. Branch delay. 


ecutes a multicycle operation, the CPU pipeline can con¬ 
tinue in parallel until the FPU detects a data or resource 
dependency. It can transfer data directly to or from the CPU 
or cache memory. The FPU executes up to three instruc¬ 
tions concurrently, one per functional unit. It retires only 
one instruction per cycle. 7 

The floating-point multiplier (see Figure 6 on next page) 
uses a modified Booth algorithm that scans four overlapping 
groups of 3 bits at once. Thus 8 bits of the multiplier operand 
can retire with each iteration. The mantissa portion of the 
multiply array uses four CSAs in a pipeline fashion. The mul¬ 
tiplier pipeline includes four stages: 

• Booth encoding and multiplicand selection, 

• partial sum-and-carry generation of selected multipli¬ 
cands, 

• partial product summation of the previous stage result 
with the previous iteration result, and 

• guard, round, and sticky-bit generation. 

In the cycle following the last iteration of the multiply, the 
sum and carry from the multiplier array travel to the float¬ 
ing-point adder to produce the final rounded product. 

The multiplier cycles at twice the pipeline clock frequency, 
so each iteration through the multiplier takes only half a pipe¬ 
line cycle. R4000’s high-speed operation demands that the 
multiplier array use a two-phase design approach. To reduce 
the clock skew in this region, the multiplier uses stronger 
clock drivers (with lower fanout). These drivers allow more 
aggressive latch designs with improved set-up times, and thus 
reduce overhead. All CSA and Booth multiplexers use dy¬ 
namic logic design due to speed criticality. 

The floating-point multiply latency is seven pipeline cycles 
for single-precision and eight for double-precision operations. 
The repeat rate is three pipeline cycles for single precision and 
four for double precision. 
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Figure 6. Block diagram of the floating-point multiplier. 


The floating-point adder (Figure 7) processes one add or sub¬ 
tract in four pipeline cycles and starts a new operation every 
three pipeline cycles for both single- and double-precision op¬ 
erations. The adder also assists the multiplier and divider for 
cleanup operations, such as rounding, and final result 
computation. 

To provide necessary bandwidth to support a two-staged, 
pipelined multiplier (as seen by the adder), we designed the 
adder to process a pair of double-precision, multiply-and- 
add instructions every four cycles. 

The adder comprises four stages: 

• unpack, 

• mantissa add, 

• result rounding, and 

• mantissa shift (alignment/normalize). 

The adder has two data entry paths. One accommodates the 


normal source operands that go 
through the unpack stage to form data 
inputs for all adder-supported opera¬ 
tions. The multiplier/divider units send 
their intermediate results on the other 
path to the adder’s input stage for fi¬ 
nal computation. No new instructions 
can enter the pipeline while the inter¬ 
mediate result travels from multiplier 
or divider to the adder for the cleanup 
cycles. The one data repacker in the 
FPU packs the final result produced 
by the adder to the correct data for¬ 
mat. 

We based the floating-point divide 
operation on the SRT divide algorithm, 8 
which selects the quotient digit based 
on an estimation of the partial remain¬ 
der. This technique has the advantage 
of not requiring a full-precision adder 
to add or subtract the partial remainder 
with a divisor multiple. Therefore it runs 
faster. The latency and repeat rates for 
floating-point divide operations are 23 
and 22 cycles for single-precision op¬ 
erations and 36 and 33 cycles for double¬ 
precision operations. (See Table 1.) 

The adder calculates square root by 
generating 1 root bit per cycle using 
the SRT algorithm. Since the adder also 
supports multiply and divide instruc- 
—tions, no new computational instruc¬ 
tion may start while it calculates a 
square root. The square-root latency 
is 34 and 112 cycles for single- and 
double-precision operations. 

Designers equipped the floating-point divider and the multi¬ 
plier units with features that allow the circuit to power down at 
the end of every operation by recirculating zeros in the unit. 

The floating-point register file is a 32-entry by 64-bit array 
with two read ports and two write ports. We dedicated one of 
the write ports for FP computational result writebacks and the 
other for FP load, store, and move instructions. In the case of 
reading and writing the same location in the same cycle, the 
register file locally bypasses the write data onto the read buses. 

Stalls, slips, and exceptions 

Pipeline hazards interrupt smooth pipeline flow (Figure 
2), causing stalls, slips, or exceptions. In stall cycles, the pipe¬ 
line does not advance. When the R4000 processes the stall, it 
restarts the pipeline and reissues several instructions to gen¬ 
erate correct results. 

For slips, such as the load interlocks detailed earlier, only 
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the DF, DS, TC, and WB stages advance 
while the IF, IS, RF, and EX stages do 
not. When the slip condition is resolved, 
the instructions in the pipeline resume 
from whatever stage they are in. For ex¬ 
ceptions, the processor suspends the nor¬ 
mal sequence of instruction execution 
and transfers control to an exception 
handler, detailed later. 

Figure 8 on the next page shows how 
the entire pipeline stalls for a data cache 
miss on load instruction 1. Since the load 
miss processing takes several cycles, the 
pipeline stalls until the secondary cache 
and main memory access completes. Note 
that before we got into the stall, instruc¬ 
tion 4 may have used erroneous data in 
its EX stage that was bypassed from the 
load instaiction. During the restart se¬ 
quence, the processor repeats the EX stage 
for instaiction 4 to obtain the coaect data 
from the LOAD operation. The different 
stall types include 

• Data cache miss, detected by the 
data tag check 

• Data first stage stalls, which can oc¬ 
cur for three mutually exclusive 
groups of instaictions. 1) The pipe¬ 
line stalls to resolve whether the FP 
instruction will cause an exception 
before moving on to guarantee pre¬ 
cise exceptions. 2) The pipeline stalls 
to let the instaiction sign extend the 
result. 3)The pipeline stalls to let the 
store buffer entries retire to memory 
because control logic has detected a 
load to the same memory location. 

• Instruction cache miss, detected by 
the instruction tag check 

• Instruction translation look-aside 
buffer stalls, for instruction TLB 
misses (explained in detail later) 

• Multiprocessor, generated by requests from other 
processors 

Slips occur when the result of an instruction is not avail¬ 
able until the DS stage of an instruction, as occurs with loads. 
Floating-point instructions interlocked for resources also cause 
slips, as do integer instructions waiting for an integer multi¬ 
ply or divide operation to complete. Variable shifts and shifts 
greater than 32 bits also use slips since these operations take 
two cycles to complete. 



Final result 


Figure 7. Adder logical block diagram. 


Table 1. Integer and floating-point operation 
latencies and repeat rates in pipeline cycles. 


Integer 


32 bits 64 bits 


Floating point 


Add/subtract 

Multiply 

Divide 


1 

10 

69 


1 

20 

133 


Latency 

SP DP 

4 4 

7 8 

23 36 


Repeat 
SP DP 

3 3 

3 4 

22 35 
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Figure 8. ADD data cache miss, use of load. STL indicates a stall. 
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Figure 9. Circuit pipelining. 

The R4000 processes many stalls and slips simultaneously. 
By slipping on instructions that need the same resources as a 
multicycle floating-point instruction, it can simultaneously 
accept other stall conditions from instructions that continue 
to advance further down the pipeline. Also, multiprocessor- 
initiated stalls, which can stall the pipeline to examine the 
cache, occur simultaneously with DCT, DFT, and ICT stalls 
described above. 

Stall and slip implementation. The state machines that 
control pipeline flow (run, slip, and restart machines) oper¬ 
ate in a pipelined fashion. When logic detects a stall or slip 
condition in a given cycle, the soonest the R4000 can process 
this condition is the end of the next cycle. 

Figure 9 shows a sample timing diagram. In the first phase, 


the pipeline control unit evaluates logic that may generate a stall 
or slip condition. In phase 2 and the second phase 1, the state 
machines are resolved. Finally, the pipeline control signals are 
distributed throughout the chip during the second phase 2. 

After processing a stall, the R4000 initiates a two-cycle re¬ 
start sequence before the pipeline can run again. During this 
sequence, it reevaluates portions of the pipeline with cor¬ 
rected information before normal pipeline flow resumes. As 
shown in Figure 8, it repeats three activities: data memory 
access, execution, and instruction issuance. 

Exception handling. The R4000 processes exceptions 
from sources in different pipeline stages. It prioritizes incom¬ 
ing exceptions and gives highest priority to the faulting in¬ 
struction furthest along the pipeline. Table 2 lists different 
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exceptions and the stages where they are signaled. 

During normal processing, the R4000 nullifies pipeline 
stages for three reasons. 

• When an exception occurs, it nullifies instructions after 
the faulting instruction. 

• It nullifies certain instructions in branch delay slots when 
a branch is taken. 

• When the pipeline slips, it creates a nullified instruction 
“bubble,” as the back end of the pipeline advances and 
the front end does not. 

After being nullified, the instruction does not commit to 
any state. For performance, the processor inhibits any stalls 
signalled by the instruction. For example, if an instruction 
will cause a data translation exception, which is detected at 
the end of the DS stage, the processor will not allow it to 
signal a cache miss in the TC stage. 

Memory management unit 

The MMU translates virtual addresses into physical addresses 
using an on-chip translation look-aside buffer (TLB). It man¬ 
ages exceptions, controls the cache subsystem, and provides 
diagnostic and error recovery facilities. Compared to the R3000, 
the R4000 MMU provides enhanced operating system sup¬ 
port including increased TLB entries, variable page sizes, 64- 
bit architecture support, supervisor privilege level, timer 
interrupts, and a physical address trap. 

We wanted to increase the number of entries in the TLB 
over the 64 entries available in the R3000 since this boosts 
performance in a wide range of applications. Using 128 entries 
required too much area for the fully associative lookup circuit. 
Therefore, we implemented a 48-entry TLB with each entry 
mapping two consecutive pages and producing 96 effective 
entries. The TLB superpipelines in the R4000 (across the DF/ 
DS pipeline stages) and runs in parallel with the cache access. 

The instruction translation look-aside buffer (ITLB) is a 
two-entry, fully associative translation buffer that is a subset 
of the main TLB. This ITLB supports only a 4-Kbyte page 
size, to reduce complexity with minimum performance im¬ 
pact. When an instruction miss occurs in the instruction buffer, 
the pipeline stalls and the main TLB refills the ITLB. When a 
branch is taken into a different page, the branch target in¬ 
struction address translation uses the TLB bandwidth avail¬ 
able during the data first and data second stages of the branch 
instruction. Since the instruction first and instruction second 
stages of the branch target line up with the data first and data 
second stages of the branch instruction, the target address 
translation refills the ITLB without stalling the pipeline. 

The R4000 implements variable page sizes on a per-page 
basis, varying from 4 Kbytes to 16 Mbytes. This helps to re¬ 
duce thrashing of the TLB in some cases, such as in the use 
of a frame buffer which uses large data blocks. It implements 



Table 2. Exceptions. 

Cycles 

Exceptions 

IF 

_ 

IS 

- 

RF 

Instruction translation 

EX 

Interrupt 

Bus error instruction 

Illegal instruction 

Breakpoint 

Syscall 

Coprocessor unusable 

ECC instruction 

Virtual coherency instruction 

DF 

- 

DS 

Overflow 

Floating point 

TC 

TLB modified 

Data translation 

WB 

Bus error data 

Virtual coherency data 

Watch 

NMI 

Reset 


variable page sizes by having a mask associated with each 
TLB entry. When addresses approach the TLB for translation, 
the corresponding mask bits in the TLB specify which virtual 
address bits participate in the comparison and translation. 

The R4000 instruction set architecture supports 64-bit ad¬ 
dressing. The current revision of the R4000 uses 40 bits of the 
64-bit virtual address space. Increasing the effective virtual 
address size above 40 bits would have made the TLB wider 
than the data path and difficult to fit into the layout. Hard¬ 
ware explicitly checks the unused upper bits (bits 61:40) of 
the virtual address to make sure they are zero, ensuring a 
smooth transition for software as the size of the virtual ad¬ 
dress grows in future revisions. The R4000 supports a physi¬ 
cal address of 36 bits. 

The unit includes a supervisor privilege level of operation, 
in addition to the kernel and user levels present in previous 
company designs. This mode improves operating system sup¬ 
port with more privilege levels. 

A CACHE instruction provides a set of operations allowing 
the implementation of both a high-performance, symmetric, 
multiprocessing operating system and a high-performance 
workstation operating system. This instruction makes some 
tasks more efficient, including block copy, page zeroing, cache 
initialization, page flushing, and cache testing. 

The CACHE instruction supports a number of operations 
including 
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• load and store of cache tags, 

• selective invalidation of cache lines, 

• create dirty exclusive data cache lines, and 

• forced writeback of lines. 

The R4000 provides a physical address trap feature for debug¬ 
ging software. This takes an exception on a reference to a se¬ 
lected physical address, which is specified in the Watch register. 

The Count and Compare registers implement a timer inter¬ 
rupt service. The Count register acts as a timer, incrementing at 
half the pipeline clock rate. When the value in the Count regis¬ 
ter equals the value in the Compare register an interrupt occurs. 

Memory hierarchy 

The R4000 fits a range of system configurations. A pro¬ 
grammable system interface permits tuning to different sys¬ 
tem specifications and exploiting future improvements in 
DRAM and SRAM design. The R4000 supports a two-level 
cache hierarchy that configures to run with different line sizes. 
Multiple cache coherency protocols available on the R4000 
support several multiprocessor systems. 9 ’ 10 

The limited available primary cache size necessitated sup¬ 
port for a closely coupled off-chip secondary cache required 
by high-end systems. We estimated the cache control section 
required 10 percent extra logic to support systems both with 
and without secondary cache. The R4000 manages its primary 
and secondary caches using a write-back method, in which 
stores send data into the caches, but the data do not write 
back to memory until the cache line is replaced or flushed. 

The processor maintains its primary caches as a subset of the 
secondary cache contents. This prevents the occurrence of vir¬ 
tual aliases, which could lead to incorrect operation. A virtual 
alias occurs when multiple virtual addresses in the primary cache 
map to the same physical address in the secondary cache. 

The primary caches are virtually indexed, so the second¬ 
ary cache stores 3 bits of the virtual address (bits 14 to 12) 
needed to locate the primary cache lines that may contain 
data from a particular secondary cache line. (This virtual ad¬ 
dress information will support primary caches up to 32 Kbytes 
each). Because only one copy of the secondary cache line 
can reside in the primary cache, no two virtual addresses in 
the primary cache can map to the same physical location. 
Without this capability, R4000 would have to flush the large 
secondary cache to prevent aliasing. This is time consuming, 
especially for aliases caused by reusing pages for I/O. 

Primary cache. While the initial version of R4000 uses an 
on-chip primary cache size of 8 Kbytes of instruction and 8 
Kbytes of data, we can easily increase these sizes. The cur¬ 
rent revision supports primary caches up to 32 Kbytes each 
of instruction and data. 

The primary cache is a direct-mapped, virtually indexed, 
physically tagged cache. Direct mapping makes it easy to find 
the location of a particular line in the cache and to manage 


cache consistency between the primary and secondary caches. 

As the primary cache is virtually indexed, the virtual ad¬ 
dress generated by R4000’s address unit looks up the cache 
line, while the address translation occurs in parallel. The ad¬ 
dress translation produces the physical address of the access, 
and the comparator compares it with the physical address 
read from the tag of the cache lines. The processor uses data 
coming out of the cache before it checks the tag, reducing 
the delay before load data can be used by one cycle. 

Direct-mapped caches access faster than associative caches, 
but their hit rate is not as high as for set-associative caches. 
This penalty decreases as we increase the size of the primary 
caches. The primary caches support two software-program¬ 
mable line sizes (16 and 32 bytes) that users can change 
independently for the instruction and data caches. 

R4000 needs two cycles to access data in the primary cache, 
but a new address may enter every cycle. This is possible 
because the processor accesses the cache array in one cycle, 
excluding the address buffering and the data drive time. The 
address does not acces the array until the beginning of phase 
2 of the first cycle, when the data from the previous access 
have been latched. 

The primary instruction and data caches have separate data 
and tag arrays. The data cache data array and tag array may be 
addressed separately every cycle. During the data first and 
data second stages of a store instruction, the processor ac¬ 
cesses the tag array for the store, while it may access the data 
array for a previous store that has passed its tag check and has 
data waiting in the store buffer. The two-entry store buffer 
decouples the data to be stored from the rest of the pipeline. 

Since the architecture supports byte stores, the data cache 
array is arranged in eight blocks. Each block has a byte of 
data, a parity bit, and a redundant bit. The primary caches 
access 64 bits of data at a time, with the ability to write se¬ 
lected bytes. Row and column redundancy terms improve 
the die yield. To replace a defective row or column in one of 
the cache arrays with a redundant row or column, the manu¬ 
facturer must blow the laser fuses. 

Secondary cache. The secondary cache is direct mapped, 
physically indexed, and physically tagged. Manufacturers can 
build it from industry-standard static RAMs of different speeds 
and densities. The 128-bit-wide secondary cache interface 
allows a single access to the secondary cache to fill a four- 
word primary cache line. This cache supports a line size of 
four, eight, 16, or 32 words. 

A physically indexed secondary cache makes multiprocessor 
support easy as all addresses on the system bus can be physical, 
eliminating the need for extra address translation information. 

With R4000 supporting a maximum secondary cache size 
of 4 Mbytes, and with several such caches present in a mul¬ 
tiprocessor system, the probability of a soft error demands 
support for error checking and correction. This ECC support 
for the secondary cache corrects 1-bit errors and detects 2-bit 
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errors. R4000 performs on-chip tag correction, but it needs 
external hardware support to correct data errors. 

We chose parity support for the primary cache since the 
on-chip caches are small and less prone to soft-error failure. 
If the operating system finds a parity error in the primary 
cache on a clean line, it can arrange to refill the primary 
cache line. When it detects a cache error, the processor takes 
an exception and jumps to uncached space. There the oper¬ 
ating system examines the cache error control register, which 
specifies the type and location of the cache error. 

One complex operation carried out in the cache logic is the 
write-back of dirty lines to memory. During writebacks, a state 
machine, the zipper ,; merges dirty (coraipted) lines in the pri¬ 
mary cache with the data from the secondary cache as the line 
transfers to the system interface. The zipper checks tags in the 
primary instruction and data caches. It invalidates both instruc¬ 
tion and data lines while merging any dirty data from the pri¬ 
mary data cache. This operation completes in four pipeline cycles 
to match the maximum speed supported by the secondary cache. 

System interface. The system interface lets the processor 
access external resources required to satisfy cache misses. It 
also allows an external agent access to some of the processor’s 
internal resources. For multiprocessor systems, the system in¬ 
terface provides the processor mechanisms necessary to main¬ 
tain cache coherence of shared data. 

R4000 uses a 64-bit-wide system interface to increase main 
memory bandwidth compared with previous 32-bit system 
interfaces. The system interface can receive a double word 
every two pipeline cycles. If R4000 is operating without a 
secondary cache, the system interface can operate at the maxi¬ 
mum system interface data rate, since the primary cache has 
a 64-bit data path that supports this rate. With a secondary 
cache, the maximum data rate the processor can support 
directly relates to the secondary cache access time. If the 
access takes too long, the processor cannot transmit or ac¬ 
cept data at the maximum rate. The sec¬ 
ondary cache only accepts reads and 
writes occurring in at least four cycles. 

With fast static RAMs that support a four¬ 
cycle access, the secondary cache inter¬ 
face can keep up with data coming in 
from the system interface at the maxi¬ 
mum rate. Designers can program the 
system interface to transmit data in a 
range of rates, to suit different system 
and secondary cache speeds. 

The system interface can be pro¬ 
grammed to be clocked by a divided- 
down version (divided by two, three, or 
four) of the internal clock frequency. The 
internal clock runs at twice the 
processor’s input, or master, clock. This 
allows systems designed for slower ver¬ 


sions of the R4000 to run faster versions. For example, a 
system designed for a 30 MHz R4000 (with the system inter¬ 
face programmed to halve the internal 100 MHz pipeline 
clock) could implement a 75 MHz R4000 with the system 
interface clock divisor changed to divide by three. A 75 MHz 
external clock generates a 150 MHz internal pipeline clock, 
which the divisor divides by three to produce a 50 MHz sys¬ 
tem clock. 

The R4000 supports an overlapped mode of operation on 
the system interface when configured with a secondary' cache. 
When a miss occurs in the secondary cache that requires a 
line to be written back to main memory, the system interface 
sends out a read request for the miss and then immediately 
sends out a write with the writeback data. This saves the 
R4000 from having to buffer up secondary cache lines before 
they are written back, which would use significant chip area 
to support the largest secondary cache line of 32 words. 

Multiprocessor support The R4000 provides mechanisms 
to implement a variety of cache coherency protocols that 
may be snoopy or directory based (see Figure 10). Designers 
closely coupled the multiprocessor logic with the pipeline 
activity to allow access to the primary caches. 

The starting point for R4000’s coherency model was the 
MESI (modified, exclusive, shared, invalid) protocol. MESI 
implements a four-state cache coherence protocol (the states 
are invalid, clean exclusive, dirty exclusive, and shared). R4000 
implements a fifth, the dirty shared state (Figure 11 on the 
next page), which allows for efficient implementations of a 
semaphore given the support for update protocol. When a 
processor successfully acquires a semaphore by gaining a 
dirty shared copy of the semaphore, all the other processors 
using that semaphore will be updated with its new value. 
They don’t need to generate additional transactions on the 
bus. With the MESI protocol, a request from another proces¬ 
sor (that is, an intervention) can cause writebacks to the sys- 
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Figure 10. Multiprocessor protocols. 
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Figure 11. Cache coherency diagram. 

tem memory. These writebacks place an additional burden 
on the system design. (The R4000 cannot process three-party 
transactions on interventions.) The processor stores the state 
of a cache line along with the tag and data for each line in 
the caches. 

When R4000 receives an external snoop, intervention, invali¬ 
date, or update, it checks the secondary cache tag and state bits 
while allowing the processor to operate within the primary cache 
space in parallel. Misses in the secondary cache require no further 
action because the primary is a subset of the secondary. If an 
external event hits in the secondary cache, access to the primary 
may be required to complete the transaction. To gain access to 
the primary cache, the processor stalls the CPU pipeline. 

The processor supports write-invalidate and write-update 
protocols, controlled on a per-page basis. The TLB may mark 
pages as uncached, noncoherent, coherent exclusive, coher¬ 
ent-write exclusive, and coherent-write update. Table 3 shows 
examples of the actions caused by these attributes. 


The R4000 provides a load linked and store conditional 
pair of instructions to provide synchronization between pro¬ 
cessors on the system bus based only on cache coherency. 
An example of this is the fetch-and-add operation. 


Loop: 11 T0,0 (Tl) 

addu TO, TO, 1 
sc TO, 0 (Tl) 
beq TO, 0, Loop 


load counter, set load link bit 
increment 

store back if load link bit still set 
retry if store failed 


Table 3. Examples of actions caused by coherency attributes. 

Algorithm 

Load-miss 

Store-miss 

Uncached 

Word read 

Word write 

Noncoherent 

Block read noncoherent 

Block read noncoherent 

Coherent exclusive 

Block read exclusive 

Block read exclusive 

Coherent write exclusive Block read 

Block read exclusive 

Coherent write updat 

Block read 

Block read/update 


The store conditional instruction fails if the location has been 
invalidated or updated since the preceding load linked in¬ 
struction. This mechanism can implement semaphores, bit- 
locks, fetch-and-add, and other synchronization mechanisms. 
It also guarantees that at least one processor on the bus will 
get the semaphore on the first attempt so deadlocks or long 
stalls will not occur. 

Design methodology 

We chose full-custom data path layout for maximum speed 
and the highest packing density. Designers implemented most 
of the control sections using a logic synthesis and optimiza¬ 
tion tool and laid them out using standard cell place-and- 
route methodology. However, to achieve our target cycle 
times, we had to custom design and lay out by hand some of 
the control sections in the critical paths. 

We used a two-phase, zero-overlap clock strategy and dis¬ 
tributed it throughout the chip with a balanced clock tree, to 
control skew. A phase-locked loop generates four times the 
frequency of the external input (master) clock and distrib¬ 
utes it through the chip. Divide-by modules at the end of the 
clock tree generate 2x- and lx clocks. The processor pipe¬ 
line and most logic use the 2x clock, which cycles twice as 
fast as the master clock frequency. The integer multiplier and 
floating-point multiplier use the high-speed 4x clock, four 
times the master clock frequency. 

The chip uses two types of register/latches: stacked and 
pass-gate dynamic. Stacked registers, used extensively, are 
immune to clock skew as long as there are zero or an even 
number of inversions between the two stacked latches. How¬ 
ever, when a short setup time and fast clock-to-output delay 
were necessary, we used pass-gate dynamic 
latches. In these cases, a design rule en¬ 
forced a delay equivalent to the time needed 
to pass through at least three inverters of a 
fan-out of three between the latches to pre¬ 
vent data slip-through. 

We equipped the output buffers with a 
digitally controlled slew rate to reduce noise 
injected into the system buses. One buffer 
determines the digital control signal values 
for the rest of the buffers. This output buffer 
sends the pad a signal, which in turn feeds 
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back into an input pad. The processor samples the round- 
trip delay and references it with the clock cycle. Users can 
program the desired amount of skew in terms of a fraction of 
a clock cycle. Depending on the control signals generated, 
the strength of the output buffer’s pullup and pull downs are 
adjusted. (See Figure 12.) 

We laid out the chip using a generic Mips design rule, so 
all our semiconductor partners can work from a single data¬ 
base. This database is based on a 1.0-|im-drawn, two-layer 
metal, CMOS technology. Manufacturers are producing the 
R4000 in 0.8 ji technology. 

Verification 

Mips carried out a functional simulation of a register trans¬ 
fer level model during the development of the R4000. The 
RTL model executes at about 1,000 processor cycles per minute 
on a 20-MIPS, R3000-based Magnum workstation. Designers 
divided the chip into major functional blocks (CPU, FPU, 
MMU, caches, and system interface) and wrote directed diag¬ 
nostic tests to exercise these functional units. Trace compari¬ 
sons of diagnostic tests run on an instruction-level simulator 
and on the R4000 RTL verified compliance with our architec¬ 
ture. To trace all the required signals and data in the R4000 
superpipeline, we added more verification logic to the R4000 
RTL model so it could capture traces for comparison with the 
instruction-level simulator traces. 

We performed extensive automatically generated random di¬ 
agnostic tests, again using our instruction-level simulator for trace 
comparison. We wrote additional verification diagnostics to en¬ 
sure that all the arcs of the state machines within the R4000 were 
exercised. Our designers executed R4000 diagnostics within an 
RTL model of a system configurable at runtime to include a sec¬ 
ondary cache and change any of the programmable parameters 
that control the system interface. They booted the Unix operating 
system on the R4000 RTL model about 
six months before Mips gave the design 
to its manufacturing partners. It took a 
50-MIPS Mips 6280 seven days of pro¬ 
cessing to reach the Unix prompt. 

We verified the multiprocessor ca¬ 
pabilities of the R4000 using a number 
of different simulation models. A uni¬ 
processor RTL simulation of the R4000 
checked that the R4000 could gener¬ 
ate and process all the multiprocessor 
requests defined by the R4000 inter¬ 
face specification. We also developed 
a simulation environment that could 
support multiple R4000 processors at 
the RTL level. Under this environment 
we ran directed diagnostic tests and 
self-checking random tests. 


implementation of the R4000 matched the RTL description 
by generating a gate-level model from schematics. Obviously, 
this model ran much slower than the RTL model, and so we 
needed a large compute resource to run the diagnostic test 
suite at the gate level. In the final stages of verification we 
used ten 6280 machines and around thirty 20-MIPS Magnum 
workstations. 

Testability and packaging 

The R4000 implements JTAG (IEEE Std. 1149.1) boundary 
scan specifications, intended to provide a test capability for 
the interconnection between the R4000 processor, the printed 
circuit board, and other components on the board. 

The chip comes in two package configurations. The 
R4000MC and R4000SC, which have the 128-bit data inter¬ 
face to the secondary cache, are packaged in a 447-pin lead 
or plastic grid array. The R4000MC supports multiprocessor 
systems while the R4000SC supports high-performance uni¬ 
processor systems. The R4000PC, for desktop, low-end serv¬ 
ers, and embedded control systems, comes in a 179-pin PGA 
with no secondary cache interface. 


Table 4 lists Specmarks for simulated results of a 

realistic memory system. (See next page). We simulated the 
CPU time and most of the important aspects of memory and 
heuristically added the I/O times. Correlation of simulations 
with R4000 systems in the lab show the simulations to be 
pessimistic. (P 
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Table 4. Simulated Specmarks for a 50-MHz 
external-clock R4000. 


S-cache size P-cache 


Benchmark 4 Mbytes 512 Kbytes only 


Gcc 

46 

43 

27 

Espresso 

54 

54 

38 

Spice2g6 

42 

38 

27 

Doduc 

49 

46 

33 

Nasa7 

56 

46 

43 

Li 

66 

65 

47 

Eqntott 

54 

52 

50 

Matrix300 

278 

273 

177 

Fpppp 

55 

54 

29 

Tomcatv 

58 

59 

37 

Simulated SPEC 

63 

59 

42 

Simulated SPEC int 

55 

53 

39 

Simulated SPEC fp 

69 

64 

44 

CPI (simulated SPEC) 

1.5 

1.6 

2.3 
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The Message-Driven Processor, an integrated multicomputer node, provides efficient mecha¬ 
nisms for parallel computing. Rather than being specialized for a single model of computation, 
the MDP incorporates primitive mechanisms for communication, synchronization, and nam¬ 
ing. These mechanisms efficiently support most proposed parallel programming models. Each 
processing node of MIT’s J-Machine consists of an MDP with 1 Mbyte of DRAM. MDPs have been 
operational since June 1991, and J-Machines built from them went on line in July 1991. 


he Message-Driven Processor is a 36- 
bit, 1.1-million transistor, VLSI micro¬ 
computer specialized to operate 

|_| efficiently in a multicomputer. The 

MDP chip includes a processor, a 4,096-word by 
36-bit memory, and a network port. An on-chip 
memory controller with error checking and cor¬ 
rection (ECC) permits local memory to be ex¬ 
panded to one million words by adding external 
DRAM chips. 

The processor is message-driven in the sense 
that it processes in response to messages, via the 
dispatch mechanism. No receive instruction is 
needed. The MDP creates a task to handle each 
arriving message. Messages carrying these tasks 
advance, or drive, each computation. 

We designed the MDP with two primary goals 
in mind. 

• We wanted to implement a general-purpose, 
multicomputer processing node that provides 
the communication, synchronization, and 
naming mechanisms required to efficiently 
support several different parallel program¬ 
ming models. 

• We wanted to create an inexpensive, VLSI 
component for cost-efficient parallel com¬ 
puters. Ideal nodes should be inexpensive 
and plentiful VLSI commodity parts—as in¬ 
expensive and plentiful as jellybean can¬ 


dies—that can network together to form a 
Jellybean Machine (J-Machine) multi¬ 
computer. 

Efficient parallel mechanisms 

Computer hardware provides primitive opera¬ 
tions called mechanisms. These mechanisms build 
the abstractions that in turn make up a program¬ 
ming system. 1 For example, most sequential ma¬ 
chines provide some mechanism for a push-down 
stack to support the last-in-first-out (LIFO) stor¬ 
age allocation required by many sequential pro¬ 
gramming models. Most machines also provide 
some form of memory relocation and protection 
to allow several processes to coexist in memory 
at once without interference. The proper set of 
mechanisms can significantly improve perfor¬ 
mance over a brute-force interpretation of a pro¬ 
gramming model. 

Over the past 40 years, sequential von 
Neumann processors have evolved a set of mecha¬ 
nisms appropriate for supporting most sequen¬ 
tial programming models. It is clear, however, 
from efforts to build concurrent machines by wir¬ 
ing together many sequential processors, that 
these highly evolved sequential mechanisms do 
not adequately support most parallel models of 
computation. These mechanisms do not efficiently 
support synchronization of events, communica¬ 
tion of data, or global naming of objects. As a 



0272-1732/92/0400-0023$03.00 © 1992 IEEE 


April 1992 23 
















MDP 


result, designers must implement these functions, inherent to 
any parallel model of computation, largely in software with 
prohibitive overhead. For example, sequential machines re¬ 
quire hundreds of instructions to create a new process. This 
cost prohibits the use of fine-grain programming models where 
processes typically last only a few tens of instructions. 

The MDP supports a broad range of parallel programming 
models, including shared-memory, 2 data parallel, 3 dataflow, 4 
actors, 5 and explicit message-passing, 6 by providing low- 
overhead primitive mechanisms for communication, synchro¬ 
nization, and naming. Its communication mechanisms permit 
a user-level task on one node to send a message to any other 
node in a 4,096-node machine in less than 2 (is. This process 
doesn't consume any processing resources on intermediate 
nodes, and it automatically allocates buffer memory on the 
receiving node. On message arrival, the receiving node cre¬ 
ates and dispatches a task in less than 1 (is. 

Presence tags provide synchronization on all storage loca¬ 
tions. Three separate register sets allow fast task switching. A 
translation mechanism maintains bindings between arbitrary 
names and values, and supports a global virtual address space. 
We selected these mechanisms to be both general and ame¬ 
nable to efficient hardware implementation. To support fine- 
grain, concurrent programming systems, we designed the 
mechanisms to efficiently handle small objects (eight words) 
and small tasks (20 instructions). 

3D array of fine-grain, processing nodes 

The MDP is an example of an inexpensive, fine-grain, mul¬ 
ticomputer building block. A fine-grain node does not neces¬ 
sarily have a slow processor. We can build a competent 
processor in a fraction of a modem VLSI chip’s area. Fine 
grain and small memory decrease the chip’s cost, resulting in 
greater arithmetic performance and local memory bandwidth 
per unit cost. Fast communication and a global address space 
prevent the small local memories from limiting programma¬ 
bility or performance. 

In a multicomputer, system cost is very sensitive to proces¬ 
sor cost. A less-expensive node results in a comparably priced 
system with more processors and, to first order, higher per¬ 
formance. In these systems, designers avoid costly features 
that give a small incremental return in processor performance 
(such as large caches) in favor of building systems with more 
nodes, an option not available to the designer of a sequential 
computer. 

The 3D network that connects MDPs gives the highest 
throughput and lowest latency for a given wire density. 7 This 
network allows the processing nodes to be packed densely 
and results in uniformly short wires. It does not waste com¬ 
munication bandwidth by embedding an esoteric topology 
into physical space. Messages traveling through the network 
follow a Manhattan shortest path in physical space; they never 
backtrack. (A Manhattan path travels forward, to the side, 


and up or down, but not across diagonals.) 

Background 

The MDP builds on previous work in multicomputer de¬ 
sign. Like the Caltech Cosmic Cube, 6 Intel’s iPSC, 8 the Ncube, 9 
and the Ametek, 10 each MDP in the [-Machine has a local 
memory and communicates with other nodes by passing 
messages. Because of its low overhead, the MDP can exploit 
concurrency at a much finer grain than these early message¬ 
passing multicomputers. Delivering a message and dispatch¬ 
ing a task in response to the message’s arrival takes less than 
2 |is on the J-Machine, as opposed to 5 ms on an iPSC-1 or 
300 (is on an iPSC-2. 

Like the BBN Butterfly 11 and the IBM RP3, 12 the MDP sup¬ 
ports a global virtual address space. The same IDs (virtual 
addresses) reference local (on the same node) and remote 
(on a different node) objects. Like the Inmos transputer, 13 the 
Caltech Mosaic, 14 and the Intel iWarp, 15 the MDP is a single¬ 
chip processing element integrating a processor, memory, 
and a communication unit. The MDP is unique because it 
extends these previous efforts with efficient primitive mecha¬ 
nisms for communication, synchronization, and naming. 1 It 
uses a direct communication network based on work reported 
by Dally, 7 Dally and Seitz, 16 and Dally and Song. 17 

System architecture 

To the hardware designer, the MDP appears as a compo¬ 
nent with a memory port, six two-way network ports, and a 
diagnostic port, as shown in Figure 1. 

The memory port provides a direct (that is, no glue) inter¬ 
face to up to 1 Mwords of ECC DRAM, consisting of 11 mul¬ 
tiplexed address lines, a 12-bit data bus, and three control 
signals. Static-column or page mode DRAMs cycle three times 
to access a 36-bit data word and a fourth time to check or 
update the ECC check bits. Current J-Machines use three 1M 
x 4 memory parts to form a four-chip processing node with 
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Figure 1. MDP pinout. The MDP has a memory port (26 
pins), six network ports (15 pins each), and a diagnostic 
port (three pins). 
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Figure 2. An array of four J-Machine processing nodes. Each node consists of one 
MDP chip and three 1M x 4 static-column DRAMs. With conventional packaging the 
node measures 2 in. x 2.75 in. 


262,144 words of memory that measures 2 in. x 2.75 in., as 
shown in Figure 2. 

The network ports connect MDPs together in a 3D mesh 
network. Each of the six network ports corresponds to one 
of the six cardinal directions (+X-X,+ Y-Y+Z-Z) and con¬ 
sists of nine data and six control lines. Each port connects 
directly to the opposite port on an adjacent MDP. We give 
details of the 3D network later in this article. 

The diagnostic port issues supervisory commands and reads 
and writes MDP memory from a console processor. The port 
consists of two control lines, a serial input line, and a serial 
output line. Using this port, a console processor can read or 
write any location in the MDP’s address space, as well as 
reset, interrupt, halt, or single-step the processor. 

Software. To a systems programmer, a bare J-Machine 
appears as a collection of node memories and register files 
operable by an instruction set that includes communication, 
synchronization, and naming mechanisms. The systems pro¬ 
grammer uses these mechanisms to implement a program¬ 
ming model. For example, one can build a shared memory 
model that gives the application programmer a single, shared 
address space. 

The implementation of a combining tree 18 illustrates the 
use of the MDP mechanisms. The combining tree (Figure 3) 
consists of a number of nodes each containing a value, a 
count, and a pointer to a parent node. 


We initialize the value to zero and 
the count to the number of inputs 
expected. To sum the values of a 
number of parallel processes, each 
node sends a COMBINE message 
containing the result of its process 
to a combining node. When the 
messages arrive, the processor con¬ 
taining the combining node creates 
a task to execute the COMBINE rou¬ 
tine. The routine adds the message 
value to the node’s value and dec¬ 
rements the count. When the count 
reaches zero, the node sends a 
COMBINE message to the node’s 
parent. 

Communication. The MDP sup¬ 
ports communication using a SEND 
instruction for message formatting, 
a fast network for delivery, auto¬ 
matic message buffering, and task 
creation upon message arrival. 

A series of SEND instructions car¬ 
ries a message of arbitrary length 
to any node in the machine. Upon 
arrival at the receiving node, a hard¬ 
ware queue buffers the message. 
When the message reaches the head of the queue, the node 
dispatches a task to handle the message. The combining tree 
example uses a pair of SEND instructions to send the COM¬ 
BINE message to a node. Upon message arrival, the MDP 
buffers the message and creates a task to execute the COM¬ 
BINE routine. 

Synchronization. The MDP synchronizes using message 
dispatch and presence tags on all states. Because each mes¬ 
sage arrival dispatches a process, messages can signal events 
on remote nodes. For example, in the combining tree ex- 



Figure 3. A combining tree sums results produced by a dis¬ 
tributed computation. Each node sums the input values as 
they arrive and then passes a result message to its parent. 
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ample, each COMBINE message signals its own arrival and 
initiates the COMBINE routine. 

In response to an arriving message, the processor may set 
presence tags for task synchronization. For example, access 
to the value produced by the combining tree may be syn¬ 
chronized by initially tagging as empty the location that will 
hold this value. An attempt to read this location before the 
combining tree had written it would raise an exception and 
suspend the reading task until the root of the tree writes the 
value. Synchronization on data availability in this manner is 
quite common in many parallel programs. 

Naming. The MDP supports naming with segmented 
memory management and translation instructions. In the com¬ 
bining tree example, the MDP allocates a memory segment 
to hold the state of each combining node. Using a segment 
descriptor, it relocates and protects accesses to the node. To 
make combining nodes relocatable across processing nodes, 
the MDP translates a node’s virtual address to find the pro¬ 
cessing node where it resides. Upon reaching this node, a 
second translation locates the segment descriptor for the com¬ 
bining node. 

Instruction set architecture 

The MDP extends a conventional microprocessor instruc¬ 
tion set architecture (ISA) with instructions to support paral¬ 
lel processing. Specifically, the MDP provides efficient 
hardware mechanisms for communication, synchronization, 
and naming. Although we describe here the MDP ISA, with 
particular emphasis on these mechanisms, readers can find 
more details in Dally et al. 19 

Register set. The MDP provides separate register sets to 
support rapid switching between three execution levels: back¬ 
ground, priority 0 (PO), and priority 1 (PI). The MDP ex¬ 
ecutes at the background level when no messages are pending. 
Each arriving message creates a task and initiates execution 
at PO or PI, depending on the message’s priority. The MDP 
executes the highest priority task at any point in time. The 
arrival of a PI message while the MDP is executing a PO task 
causes the MDP to switch execution levels (and thus register 
sets). When the PI task completes, the MDP resumes execu¬ 
tion at PO by switching to the PO register set that holds the 
register state of the suspended task. 

The register set at each priority level includes 

• four general-purpose data registers, R0-R3, 

• four address registers, AO-A3, 

• four ID registers, ID0-ID3, and 

• one instruction pointer, IP. 

The background register set does not include ID registers. 
They only exist at PO and PI. 

Most instructions operate on the general registers R0-R3. 
Each address register A0-A3 contains a segment descriptor 


consisting of a base and a length field. Memory addresses are 
specified by an offset and an address register. For example, 
the operands [RO, Al] and [3, A2] specify an indexed access 
to the segment described by Al and a displacement of three 
words into A2’s segment. 

ID registers usually hold object IDs. The instruction pointer 
includes process status bits that control virtual addressing, 
type checking, and fault handling. Placing these bits in the 
instruction pointer enables control and execution states to 
change by loading a single register. The relatively small size 
of each register set facilitates quick task switching within an 
execution level. 

Tags. The MDP uses tags for type checking and synchro¬ 
nization. Every 36-bit word of register and memory state holds 
a 32-bit value and a 4-bit tag that indicates the type of the 
value. Tag values are defined for primitive user data types 
(such as symbol, integer, and Boolean) and for system data 
types, such as IP, Addr (a segment descriptor), and Msg (a 
message header). Four tag values are user-definable. If type 
checking is enabled, the MDP checks operand tags to deter¬ 
mine which form of an instruction to execute. It raises an 
exception if the operands are incompatible with the instruction. 

Two tags, Fut and Cfut, support intertask synchronization. 
A Cfut tag initially marks a location empty. When a task pro¬ 
duces the value for the location, it overwrites the Cfut with 
the final value and tag. Any attempt to read from the location 
before the value is produced invokes the Cfut fault handler, 
which typically suspends the reading task until the location 
is written. Fut is used for global synchronization, and Cfut for 
local. 

Hardware support for tags makes software more efficient 
and robust. A program can perform an operation without 
checking whether operands are present or of the correct type. 
For normal cases in which no fault occurs, execution pro¬ 
ceeds faster than if special test and branch instructions were 
required to check for type and presence. Only exceptional 
cases incur the overhead of running a fault handler. 

Instructions. The MDP executes 17-bit, fixed-format, three- 
address instructions with the format shown in Figure 4. Each 
instruction specifies an operation, two general register oper¬ 
ands, and a third operand that may be a register, a memory 
location, or a constant. Two 17-bit instructions fit into each 
36-bit word. Any instruction stream word not tagged as an 


16 11 10 9 8 7 6 0 


Opcode 

Operand 2 

Operand 1 

Operand 0 


Register operands 


Register, constant, 
or memory operand 


Figure 4. MDP instruction format. 
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General movement and type instructions 


message send. The first SEND instruction reads the absolute 


READ 

WRITE 

READR 

WRITER 

RTAG 

address of the destination node in < X , K,Z > format from RO 

WTAG 

LDIP 

LDIPR 

CHECK 


and forwards it to the network hardware. The SEND2 in¬ 

Arithmetic and logic instructions 



struction reads the first two words of the message out of 

CARRY 

ADD 

SUB 

MULH 

MUL 

registers R1 and R2 and enqueues them for transmission. The 

ASH 

LSH 

ROT 

AND 

OR 

final instruction enqueues two additional words of data, one 

XOR 

FFB 

NOT 

NEG 

LT 

from R3, and one from memory. The use of the SEND2E 

LE 

GE 

GT 

EQUAL 

NEQUAL 

instruction marks the end of the message and causes it to be 

EQ 

NEQ 




transmitted into the network. This sequence executes in four 

Network instructions 




clock cycles (250 ns). 

SEND 

SENDE 

SEND2 

SEND2E 


The network delivers an injected message to the destina¬ 

Associative lookup table instructions 


tion node, as described later. At the destination, a hardware- 

XLATE 

ENTER 

PROBE 



managed, FIFO queue in the internal RAM of the MDP buffers 

Special instructions 




the message. Separate queues exist for P0 and PI messages. 

NOP 

INVAL 

SUSPEND 

» CALL 


Task scheduling. When a message reaches the head of 

Branches 





the highest priority nonempty queue, the MDP creates a task 

BR 

BNIL 

BNNIL 

BF 

BT 

to handle it by changing the thread of control and creating a 

BZ 

BNZ 




new addressing environment, as shown in Figure 7. Every 


Figure 5. Six categories of MDP instructions. 


instruction is loaded as a constant into register RO. This pro¬ 
vides a very efficient means to load arbitrary 36-bit constants. 
Figure 5 summarizes the MDP instruction set by category. 

Naming. The MDP supports naming via translation instruc¬ 
tions and segmented addressing. Addressing memory' through 
segment descriptors permits arbitrary size objects to be relo¬ 
cated and protected. The ENTER instruction enters an arbitrary 
translation from a 36-bit key to a 36-bit data value in a set- 
associative cache (translation table) mapped into the on-chip 
memory. The XLATE instruction looks up the data value (if 
any) associated with a key. These instructions can translate an 
object’s name into a physical segment descriptor or a node 
number to support a global virtual address space. 

Communication. The MDP provides hardware support 
for end-to-end message delivery including formatting, injec¬ 
tion, delivery, buffer allocation, buffering, and task scheduling. 

An MDP transmits a message using a series of SEND in¬ 
structions, each of which injects one or two words into the 
network at either priority 0 or 1. Figure 6 shows a typical 


SEND R0,0 ; send net address (priority 0) 

SEND2 R1,R2,0 ; header and receiver (priority 0) 

SEND2E R3,[3,A3],0 ; selector and continuation - 
end msg. (priority 0) 

Figure 6. MDP assembly code to send a four-word mes¬ 
sage uses three variants of the SEND instruction. 


message header contains a message opcode and the mes¬ 
sage length. The MDP loads the message opcode into the 
instruction pointer to start a new thread of control. The length 
field and the queue head create a message segment descrip¬ 
tor (automatically written to A3) that represents the initial 
addressing environment for the task. The message handler 
code may open additional segments by translating object IDs 
in the message into segment descriptors. Creating a task to 
handle a message takes three cycles. 

The dispatch mechanism directly processes messages re¬ 
quiring low latency (for example, combining and forwarding). 
Other messages, such as a remote procedure call, specify a 
handler that locates the required method (using the translation 
mechanism described earlier) and then transfers control to the 
method. 



Figure 7. Message dispatch. In three clock cycles, a node 
creates a new task by setting the instruction pointer to 
change the thread of control and creating a message seg¬ 
ment to provide the initial addressing environment. 
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MOVE [1,A3],R0 ; get method ID 

XLATE R0,A0 ; translate to segment descriptor 

LDIP INITIALJP ; load instruction pointer to 

transfer control to method 


Figure 8. MDP assembly code for the CALL message. 

For example, Figure 8 shows the CALL handler code han¬ 
dling a remote procedure call. Figure 9 depicts the execution 
of the handler. The first instruction gets the method ID (off¬ 
set one word into the message segment 
referenced by A3). The next instruction 
translates this method ID into a segment 
descriptor for the method and places 
this descriptor in AO. In one of its oper¬ 
ating modes, the MDP can use AO as a 
pointer to a segment of code and IP as 
an index into that segment. This allows 
code to be easily relocated at runtime. 

The final instruction of the CALL han¬ 
dler transfers control to the method by 
loading the IP with a short integer off¬ 
set. Thereafter the MDP will fetch in¬ 
structions from the called method. 

The method code may then read in 
arguments from the message queue. The 
XLATE instruction translates argument 
object identifiers to physical memory 
base/length pairs. If the method needs 
space to store local state, it may create 
a context object. When the method fin¬ 
ishes executing, or when it needs to wait 
for a reply, it executes a SUSPEND in¬ 
struction, which dequeues its message 
and passes control to the next message 
in the queue. 

An example of a direct message han¬ 
dler is the COMBINE routine shown in 
Figure 3. Figure 10 displays the code 
for this routine. If the node is idle, ex¬ 
ecution of this routine begins three 
cycles after message arrival. The rou¬ 
tine loads the combining node pointer 
and value from the message, performs 
the required add and decrement, and, 
if Count reaches zero, sends a message 
to its parent. 

This 12-instruction routine executes 
in 21 cycles. It demonstrates several 
ways in which the MDP’s communica¬ 
tion mechanism reduces the overhead 
of message passing to the point where 


it can perform simple operations, such as combining. These 
ways include the following: 

• The MDP hardware dispatches the COMBINE task by 
setting the instruction pointer to COMBINE and initializ¬ 
ing message pointer A3 to allow direct access to mes¬ 
sage words. This avoids the overhead otherwise 
associated with control transfer and with setting up an 
addressing environment. 

• The two SEND instructions transmit the four-word mes- 

Memory 



Figure 9. The CALL message invokes a method by translating the method identi¬ 
fier to find the code, creating a context (if necessary) to hold local state, and 
translating argument identifiers to locate arguments. 


COMBINE: 


DONE: 


MOVE [1,A3], COMB 

MOVE [2,A3], R1 

ADD R1, COMB.VALUE, R1 

MOVE R1, COMB.VALUE 

MOVE COMB.COUNT, R2 

ADD R2, -1, R2 

MOVE R2, COMB.COUNT 

BNZ R2, DONE 

MOVE HEADER,R0 

SEND2 COMB.PARENT_NODE, R0 

SEND2E COMB.PARENT, R1 

SUSPEND 


; get node pointer from msg 
; get value from msg 

; store result 
; get Count 

; store decremented Count 

; get message header 
; send message to parent 
; with value 


Figure 10. MDP assembly code for the combining tree example. 
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sage to the parent task. The message transmits directly 
from register and memory variables with no need to first 
format it in memory. 

• The SUSPEND instruction terminates the task and simul¬ 
taneously dequeues the message. If another message is 
pending in the queue, the processor dispatches a task to 
handle it two cycles after the execution of the SUSPEND 
instruction. 



Figure 11. The J-Machine network is a 3D mesh or /c-ary 
3-cube. The network performs e-cube or destination tag 
routing. Messages route in each dimension in turn to the 
proper coordinate in that dimension. In this figure, a mes¬ 
sage routes from (1,5,2) to (5,1,4), routing first in X, then 
Y, then Z. 


Network architecture 

The MDP contains a network interface and a router that 
support a communication network closely integrated with 
the processor. In a J-Machine composed of MDPs, the net¬ 
work provides end-to-end message delivery with low latency 
(less than 2 jis in a 4,096-node network) and high bandwidth 
(288 Mbits per second per channel). Message delivery occurs 
entirely within the routers of the machine and consumes no 
processor or memory resources at intermediate nodes. 

Structure. The J-Machine network is a 3D grid, with two- 
way channels, dimension-order routing, and blocking flow 
control. (See Figure 11.) Addressing limits the size of the 
network to 65,536 nodes (32 x 32 x 64). Our initial prototype 
is a 1,024-node machine (8 x 8 x 16). The faces of the net¬ 
work cube are open for use as I/O ports to the machine. 
Each channel can sustain a data rate of 288 Mbps. All three 
dimensions may operate simultaneously for an aggregate data 
rate of 864 Mbps per node. 

Three modules, shown in Figure 12, compose the network 
logic. The network output module buffers words and injects 
them into the network. The three routers, one for each di¬ 
mension of the network, route messages from node to node. 
The network input module reassembles messages at their 
destination and buffers them into a message queue. We de¬ 
scribe more details of implementation in the next section. 

Engineering. We chose the 3D mesh topology of the J- 
Machine network as the most efficient arrangement subject 
to constraints of wiring density and component pinout. 7 These 
constraints set the width of the six bidirectional channels per 
MDP node at 9 data bits plus 6 control bits. We built the J- 
Machine as a stack of boards with dense board-to-board in¬ 
terconnections to implement the 3D network with short wires. 

The MDP breaks with the tradition of asynchronous net¬ 
work routers by implementing a synchronous router. 16,17 This 
router operates at twice the rate of the 
processor, sending a pair of 9-bit pbits 
between nodes each 62.5-ns processor 
cycle (A phit is a physical digit, the width 
of the physical channel. A pair of phits 
form a flit , or flow-control digit, the 
granularity of flow control in the net¬ 
work. An 18-bit flit is half an MDP data 
word.) 

Each of the six bidirectional channels 
can be turned around on alternate cycles 
with no contention penalty. A novel pad 
design tolerates clock skew between 
routers and eliminates the potential for 
conduction overlap when the channel 
reverses direction. 20 Messages route 
through the network with a latency of 
one 62.5-ns processor cycle per hop. 
Thus, message latency T is given by 



To 


^ external 
DRAM 


18 


Figure 12. MDP block diagram. 
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T= T c (2L + D), 

where T c is the processor cycle time, L is message length in 
words, and D is the distance (number of nodes) a message 
must traverse. For example, in a 1,024-node machine, an Z=6 
word message to a random destination traverses an average 
of Z>10 nodes for a latency of T- 22 cycles or 1.4 |is. The 
bisection bandwidth (the bandwidth across a plane dividing 
the machine into two equal halves) of a 1,024-node machine 
is 18.4 Gbps. The aggregate bandwidth of the network chan¬ 
nels is 864 Gbps, and the I/O bandwidth is 184 Gbps. 

Routing and flow control. The J-Machine uses deter¬ 
ministic dimension order routing, also called e-cube routing. 
As shown in Figure 11, all messages route first in the X di¬ 
mension, then in Y, then in Z. Since messages route in di¬ 
mension order and messages running in opposite directions 
along the same dimension do not block, we avoid resource 
cycles, and leave the network provably deadlock free. 21 

Table 1 lists the format of a message. The first three flits of 
the message contain the X, Y, and Z addresses. Each node 
along the path compares the address in the head flit of the 
message with the node’s index in the current dimension. If 
the two indices match, the node strips the head flit off the 
message and routes the rest to the next dimension. The MDP’s 
network output node formats the address flits of the mes¬ 
sage. It also precomputes the direction (positive or negative) 
the message must travel along each dimension, setting addi¬ 
tional bits in the address flits. This reduces the latency and 
complexity of the router nodes. 

The network uses blocking flow control to resolve conten¬ 
tion for a physical channel (see Figure 13). When a message 
arrives at a router path already in use by a message of the 
same priority, it is blocked. The blocked message compresses 


Table 1. 

A typical message in the J-Machine. 

Flit 

Contents 

Remark 

1 

5:+ 

X address 

2 

1:- 

/ address 

3 

4:+ 

Z address 

4 

MSG: 00 

Method to call 

5 

00440 


6 

INT: 00 

Argument to method 

7 

0023 


8 

INT: 00 

Reply address 

9 

<1:5:2> 

T | 

The first three flits contain the destination address. The 

final flit in the message 

is marked as the tail. 


(a) 

(b) 

(c) 



Figure 13. The J-Machine network performs blocking flow 
control with two stages of queueing per node. Message 
arrives at busy channel (a). Message becomes compressed 
by queueing (b). Channel is available; message continues 
advancing (c). 


into routers along its path, occupying one node per word 
(two flits) of the message. When the blockage clears, the 
message uncompresses and proceeds to its destination, at a 
rate of one hop per cycle. 

Two priorities of messages share the physical w r ires, but 
use completely separate buffers and routing logic. This al¬ 
lows priority 1 messages to proceed through blockages at 
priority 0. Without this ability, the system could not redistrib¬ 
ute data that has caused hot spots in the network. 

MDP implementation 

Figure 12 shows the major subsystems in the MDP. The 
chip includes a conventional microprocessor with prefetch, 
control, register file and ALU (RALU), and memory blocks. 
The communication system comprises the routers and net¬ 
work input and output interfaces. The address arithmetic unit 
(AAU) provides addressing functions. The MDP also includes 
a DRAM interface, control block, and diagnostic interface. 

Communication subsystem. The communication sub¬ 
system contains the network output, the network input, and 
the routers. The network output block buffers messages from 
the registers or memory and injects them into the network. A 
FIFO buffer matches the speed of message transmission to 
the network. On each SEND instruction, the MDP transfers 
one or two words to its FIFO. When the message is com¬ 
plete, or the eight-word buffer is full, the buffer launches the 
message into the network. In cases where the MDP cannot 
send message words as fast as the network can transmit them, 
the FIFO prevents bubbles (absence of words) from entering 
the network pipeline and degrading performance. 

The network input module transfers messages from the 
network to the MDP’s memory. Data from the network arrive 
in 18-bit flits, which are composed into a four-word queue 
row buffer. When the QRB fills, it writes its contents to the 
on-chip memory in one cycle. Writing memory a row (4 x 36 
bits) at a time reduces the number of memory cycles con¬ 
sumed by the network, leaving more memory bandwidth for 
the CPU. 

The routers form the switches in a J-Machine network and 
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(b) 


Address check 


Figure 14. Block diagram of the routers. The two priorities 
per dimension are completely separate except where they 
share physical channels (a). Each priority contains forward, 
reverse, and previous to next dimension datapaths (b). 


input is locked out for the duration of the message. Once the 
head flit of the message has set up the route, subsequent flits 
follow directly behind it. 

Address arithmetic unit. The AAU, the largest logic block 
in the MDP, performs all functions associated with memory 
addressing. To support naming and relocation, the AAU con¬ 
tains the address and ID registers. It protects memory ac¬ 
cesses and implements the translation instructions. Each 
memory reference is offset by the selected address register’s 
base field and checked against its length field. An attempt to 
access through an invalid address register (which may occur 
when an object relocates) or to access beyond the end of an 
object raises an exception. A translation base/mask register 
defines an area of memory to be a two-way, set-associative 
translation buffer used by the XLATE, PROBE, and ENTER 
instructions. The AAU hashes the keys used to access this 
table using an exclusive-Or network to improve hit rate in 
the translation buffer. 

The AAU maintains two queues to buffer incoming mes¬ 
sages and schedule the associated tasks. Associated with each 
queue are a queue base/mask (QBM) and a queue head/ 
length (QHL) register. (See Figure 15.) The QBM registers 
define the position and length in main memory of the mes¬ 
sage queues. Queues are circular, so messages at the end of 
the queue wrap around to the beginning. The QHL registers 
point to the beginning of the first message in the queue and 
its length field encompasses exactly all of the messages cur¬ 
rently in the queue. When the MDP dispatches a task to 
handle a message, it loads the A3 register with a segment 
descriptor for the message. The processor dispatches a task 
as soon as the first four words of a message are written. If the 
task attempts to read a word of the message which has not 
yet arrived, a special Early fault occurs. 

layout Figure 16 shows a floor plan of the chip with a die 
photograph for comparison. Table 2 breaks down the area usage. 


deliver messages to their destinations. As shown in Figure 
14a, the MDP contains three independent routers, one for 
each bidirectional dimension of the network. Each router 
contains two separate virtual networks with different priori¬ 
ties that share the same physical channels. The priority 1 
network can preempt the wires even if the priority 0 network 
is congested or jammed. 

Each of the 18 router paths contains buffers, comparators, 
and output arbitration (Figure 14b). On each data path, a 
comparator compares the lead flit, which contains the 
destination’s address in this dimension, to the node coordi¬ 
nate. If the head flit does not match, the message continues 
in the current direction. Otherwise the message is routed to 
the next dimension. Messages entering the dimension com¬ 
pete with messages continuing in the dimension at a two-to- 
one switch. Once a message is granted this switch, any other 



Figure 15. The AAU maintains the queue base/mask (QBM) 
registers, which specify the location of the message queues 
in main memory, and the queue head/length (QHL) regis¬ 
ters, which specify the beginning and end of the messages 
received in each queue. Figure shows only one queue. 
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MDP 


512 x 144-bit SRAM 


512 x 144-bit SRAM 

(2,048 words) 


(2,048 words) 

Internal memory interface 


Internal memory interface 


X router 


Y router 


Z router 


(a) 


Address 

arithmetic 

unit 

(Datapath) 


Pre¬ 

fetch 

Net input 

External 

memory 

interface 

Diag¬ 

nostic 


Net output 


Clock 


o 


Registers 

c 

o 

o 


Arithmetic/logic 

3 


unit 

< 



GC 


(Datapath) 


Figure 16. MDP chip floor plan (a) and die photograph (b). 

Methodology. We implemented the MDP using Intel stan¬ 
dard cells except for the on-chip RAM, clock generator, and 
pads. Using standard cells sacrificed a factor of three to four 
in area and two to three in performance over what would be 
possible with full-custom design. The advantage was a sig¬ 
nificant increase in productivity which was essential to com¬ 
pleting the chip successfully with our small design team. 

The 700 or so sheets of schematics drafted at MIT used 
35,000 standard cells containing 210,000 transistors. (The re¬ 
maining 890,000 devices are contained in the full custom 
portions of the chip, mostly in the RAM.) We sent these sche¬ 
matics to Intel for layout. Designers laid out many of the data 
paths by hand to exploit the regularity of the design. Auto¬ 
matic place and route CAD tools laid out the less regular 
Collections of logic. 

We began architecture studies leading to the MDP in Octo¬ 
ber 1986. Work on the RTL model of the microarchitecture 
began in June 1988, and schematic entry at MIT started that 
November. The task of translating schematics into layout com¬ 
menced in June 1989, and we finished the layout in Decem¬ 
ber 1990. We received first silicon in June 1991 and were 
running programs on it within a few hours. 



Table 2. Chip area breakdown. 


Dimensions 

Area 

Transistors 

Module 

(mm) 

(mm 2 ) 

(xIO 3 ) 

AAU 

3.7 x 7.0 

25.9 

75.0 

RALU 

3.7 x 2.9 

10.7 

39.0 

Diagnostic 

0.9 x 1.1 

1.0 

3.7 

Prefetch 

0.9 x 1.1 

1.0 

3.2 

Control 

1.1 x 2.6 

2.9 

8.7 

Internal memory 




interface 

7.8 x 0.5 

3.9 

13.0 

External memory 




interface 

1.6 x 1.8 

2.9 

9.0 

Net input 

1.8 x 0.7 

1.3 

4.4 

Net output 

2.1 x 1.8 

3.8 

18.0 

Routers 

8.4 x 1.3 

10.9 

29.0 

RAM 

8.8 x 4.9 

43.1 

880.0 

Clock 

0.7 x 0.8 

0.6 

0.1 

Pads 

50.5 x 0.2 

8.4 

2.6 

Full chip 

10.2 x 15.0 

153.0* 

1,087.0 

* Includes wiring 

between modules 
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Figure 17. Photograph of 64-node J-Machine system. 



Figure 18. A 1,024-node J-Machine chassis. 


Although we thoroughly simulated the logic design, we 
have uncovered 12 bugs while running our validation tests 
and applications on the hardware. Some of these bugs have 
simple software work-arounds, but for performance reasons 
we sent a second revision of the layout with modified control 
logic and some metal fixes for fabrication in January of this 
year. We plan to use several thousand of these chips to build 
research multicomputers at MIT. 

System design. Figure 17 shows a photograph of a 64- 
node J-Machine processor board measuring 20.5 in. x 24 in. 
Each node consists of an MDP chip (in a 168-pin grid array 
package) and three 4-Mbit DRAMs. Each pair of nodes shares 
a set of elastomeric connectors to communicate with the cor¬ 
responding nodes on the boards above or below die board 
in a stack. A total of 32 elastomeric connectors held in four 
connector holders provide 2,240 electrical connections be¬ 
tween adjacent boards. Of these connections, 960 are used 
for signalling and the remaining are ground returns. No power 
is supplied through the elastomers. Bus bars supply power 
and ground directly to each board. The center area of the 
board contains the final stage of the clock distribution net¬ 
work, along with diagnostic fan-out, multiplexing logic, and 
temperature and airflow monitors. 

Figure 18 shows a photograph of our chassis for a 1,024- 
node system. The chassis contain a stack of 16 processor 
boards, power supplies, and distribution bus bars. Twenty 
tie rods bind the boards and compress the elastomer connec¬ 
tors. A 4,096-node system can be built by combining four 
chassis. Each stack connects to its neighboring stacks by 128 
(16 x 8) short, 60-pin, ribbon cables—one for each pair of 
nodes on the periphery. Each vertical pair of stacks shares a 
3,000 cu ft/min. blower for cooling. 

In addition to the processor board and chassis, we have 
also designed a diagnostic interface board and are designing 
a SCSI disk interface, a distributed graphics frame buffer, and 
an S-bus interface. Noakes and Dally 22 offer more details of 
the J-Machine system design. 

Software 

We intended the J-Machine as a platform for software ex¬ 
periments in fine-grain, parallel programming. To this end, 
we have implemented and are studying software systems for 
different fine-grain programming models. Fine-grain programs 
typically execute from 10 to 100 instructions between com¬ 
munication and synchronization actions. Reducing the 
grain size of a program increases both the potential speedup 
due to parallel execution and the potential overhead associ¬ 
ated with parallelism. Special hardware mechanisms to re¬ 
duce the overhead due to communication, process switching, 
synchronization, and multithreading are therefore central to 
the design of the MDP. Software issues such as load balanc¬ 
ing, scheduling, and locality remain open questions and are 
the focus of current research efforts. 
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(defmethod Size-Of-Tree Pair () 

(+ (Size-Of-Tree Left) 
(Size-Of-Tree Right))) 
(defmethod Size-Of-Tree Object () 
1 ) 

(defmethod Size-Of-Tree Null () 

0 ) 


Figure 19. Concurrent Smalltalk source to compute Size- 
Of-Tree. Method definitions specify the class to which 
they apply. The class Pair contains two elements, Left and 
Right, each of which may hold an Object or another Pair. 


A parallel processor creates programming challenges. It is 
difficult to extract the fine-grain parallelism needed from stock 
programs written in C or Fortran. Instead of concentrating on 
extracting parallelism from existing programs (an active and 
interesting area for many parallel programming researchers) 
or on adapting sequential languages for the parallel domain, 
we focus on languages where the expression of fine-grain 
parallelism is much cleaner. To date, we have implemented 
two languages on the J-Machine: the actor language Concur¬ 
rent Smalltalk and the dataflow language Id. 

Concurrent Smalltalk. CST 23 is a parallel, object-oriented, 
programming language (based on the Actor model 5 ) with 
asynchronous message send and distributed objects. Its syn¬ 
tax is similar to that of Lisp or Scheme. It performs method or 
function invocation by sending a message to the first argu¬ 
ment of the method. The message contains the method se¬ 
lector and the rest of the arguments. 

Functions and methods in the language are compiled into 
MDP assembly code by an optimizing compiler, called Opti¬ 
mist, and assisted at runtime by a small kernel called Cosmos. 

MODULE OBJ:Selector.Size_Of_Tree 


Cosmos provides a global virtual name space, object-based 
memory management, support for distributed objects, and low- 
overhead context switching. Its memory management system 
provides fast, transparent access to storage distributed across 
the machine. Cosmos efficiently supports fine-grain concur¬ 
rent computation in which tasks are very short (40 user in¬ 
structions) and data objects are very small (eight words). The 
CST compiler and the Cosmos runtime system also provide 
floating-point arithmetic, simple arrays, and garbage collection 
for CST programs. Cosmos manages contexts, futures, and ob¬ 
jects, and therefore plays an important role in providing ser¬ 
vices that exploit the communication, synchronization, and 
naming mechanisms of the J-Machine. 

Figure 19 shows a small sample program defining the Size- 
Of-Tree method for three object types: a Pair, the Null object, 
and a generic object. When called on a Lisp-style tree, these 
methods return the number of generic objects stored in the tree. 
For example, when called on the tree ’((1 2 3X4 5 6X7 8 9)), 
Size-Of-Tree returns the value 9. Note that since Pair and Null 
are subclasses of Object, their more specific methods are se¬ 
lected when Size-Of-Tree is invoked on their types. 

When Optimist, the CST compiler, compiles this example 
program, it defines a selector object and three function ob¬ 
jects. The selector object (shown in Figure 20) lists the type 
and function correspondence. When a method applies to an 
object, Cosmos examines the object type and locates the ap¬ 
propriate function in the selector object. The MDP then in¬ 
vokes this function on the object. (In cases where the compiler 
can infer the type of the object or when the type of objects is 
explicitly declared, the compiler optimizes a method invoca¬ 
tion directly to the correct function invocation.) The com¬ 
piler marks the selector object as copyable, and Cosmos 
maintains it like any other object. 

Figure 21 shows the compiled code for the function for the 
class Pair. When a method applies to a particular object, 
Cosmos examines the object class and the selector object, 
and chooses the correct function to invoke. 

The function first does an XLATE opera¬ 
tion to get the address of the Pair and uses 
that address to get the object ID for Left. It 
then calls Cosmos to find the node where 
Left exists. The function sends a message to 
Left that recursively applies the Size-Of-Tree 
method. It marks the slot that will hold the 
return value with a Cfut tag. Next, it applies 
Size-Of-Tree to Right without waiting for the 
result of the first remote procedure to return. 
However, when the function attempts to add 
the two return values, the results will prob¬ 
ably not have returned yet. In this case, the 
ADD instruction will fault trying to add Cfuts, 
and the MDP will suspend the process, sav¬ 
ing its registers into the context. 


DC 

Copyable I class_Selector 

; Identify properties of 
; Size_Of_Tree selector 

DC 

OBJ Selector. Size_Of_Tree 

; Store own ID inside selector 

DC 

3 

; Number of functions 

DC 

CLASS:Object 

; Class identifier for Object 

DC 

{f unction. Size_Of_Tree} 

; Function for class Object 

DC 

CLASS:Null 

; Class identifier for Null 

DC 

{f unction. Size_0f_Tree_1} 

; Function for class Null 

DC 

CLASS:Pair 

; Class identifier for Pair 

DC 

{function.Size_Of_Tree_2} 

; Function for class Pair 


Figure 20. Selector object generated by the example program. 
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MODULE 


OBJ:f unction. Size_Of_Tree_2 


DC 

Copyable dass_Function 

DC 

{OBJ :f unction.Size_0f_Tree_2} 

;; Incoming: 

A1 points to the context 

A3 points to the message 

START: 


MOVE 

[2,A3],R3 

XLATE 

R3,A2 

MOVE 

[2,A2],R0 

CALL 

objectNode.RI 

DC 

MSG:Apply_Selector 

SEND2 

R1,R0 

DC 

{OBJ Selector. Size_Of_Tree} 

SEND 

RO 

SEND 

[2,A2] 

MOVE 

5,R0 

SEND2E 

[1 ,A1 ],R0 

WTAG 

R0,CFUT,R0 

MOVE 

R0,[5,A1] 

MOVE 

[3,A2],R0 

CALL 

objectNode.RI 

DC 

MSG:Apply_Selector 

SEND2 

R1,R0 

DC 

{OBJ:Selector.Size_Of_Tree} 

SEND 

RO 

SEND 

[3.A2] 

MOVE 

6,R0 

SEND2E 

[1,A1],R0 

WTAG 

R0,CFUT,R0 

MOVE 

R0,[6,A1 ] 

MOVE 

[6, A1 ],R2 

ADD 

R2,[5,A1],R1 

MOVE 

[3,A3],R3 

BNIL 

R3, A L001 

DC 

MSG:Reply 

SEND2 

R3,R0 

SEND 

R3 

SEND2E 

[4,A3],R1 

L001: 


SUSPEND 

END 



; Identify properties of Size_Of_Tree function 
; Store own ID inside function 


Get the Pair's object ID 
Find the Pair's local address 
Get the object ID of Left 
Find Left's node -> R1 
Send a message to Left to apply 
the method specified by the 
selector for Size_Of_Tree. 
Includes our context ID 
and a continuation. 


Make a future for Left's 
result. 

Do the same for Right as for 
Left. 


; Do the sum. 

; Get the continuation for 
; this context. Reply if 
; non-nil. 


Figure 21. Compiled code for the Size-Of-Tree function for objects of class Pair. 


Assuming this happens, when the MDP receives replies from 
the methods after writing the value into the future slot, Cos¬ 
mos checks to see if the process was waiting for that particular 
future. If so, it reactivates the context. The reactivated function 
would then sum the two results and forward them to the con¬ 


tinuation specified in the original method invocation. 

Let us consider some interesting points: 

• If the object of the function is not present or if the trans¬ 
lation cache does not have an entry for the object, the 
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XLATE instruction will fault. Cosmos will find the object 
and move or copy it to tire local node. 

• Cosmos maintains functions and selectors like any other 
immutable object. If they are not present, Cosmos will 
copy them to the node, a process analogous to a distrib¬ 
uted instruction cache. 

• If the function were preempted and the object moved or 
migrated away, Cosmos would invalidate the address 
registers. Accesses to the object would cause a fault that 
would attempt to retranslate or reobtain the object. 

• The A1 register points to the current context. The con¬ 
text contains storage to hold working variables or, if the 
context faults, to hold spilled register values. In the ex¬ 
ample, the futures are constructed in the context, and 
thus are named context-future (Cfut). 

This example illustrates some important research questions 
related to the efficiency of this model of computation. 

• When is it better to spawn processes nonlocally rather 
than locally? This is probably a strong function of the 
amount of associated overhead. The MDP architecture 
attempts to reduce this overhead, but algorithms for 
making this trade-off at compile and runtime still need 
to be developed and evaluated. 

• How should we place objects in the machine, and how 
should they migrate in order to reduce the overhead of 
communication? 

• In some cases, the amount of parallelism grows much 
larger than the machine can handle. We need to study 
how we can effectively and automatically throttle the 
parallelism created by the machine when it becomes 
saturated. 

Hoiwat discusses these issues, and others related to the 
efficiency of programming fine-grain, parallel processors in 
more detail. 23 

Dataflow implementation. Id is a functional program¬ 
ming language originally designed for dataflow architectures. 24 
The Id compiler converts an Id program into a dataflow graph, 
in which nodes represent operators and arcs represent de¬ 
pendencies. Originally, researchers executed these dataflow 
graphs directly on specialized dataflow machines. More re¬ 
cently, they have begun compiling dataflow graphs to run on 
general-purpose parallel machines. 25 Dataflow programs suit 
large parallel computers, because the abundance of fine-grain 
tasks—each of which can be as small as a single dataflow 
operator—makes it easy to mask communication latency with 
task switches. Conversely, the J-Machine’s fine-grain mecha¬ 
nisms make it an excellent target for dataflow programs. 

We experimented with several methods of executing 
dataflow programs on the J-Machine. 26 The simplest of the 
systems translates each node of the dataflow graph into a 


sequence of MDP instmctions. A dataflow node with two 
inputs takes 20 MDP instmctions to simulate. To do so, it 
stores the first data value, matches it with the second value 
when it arrives, performs the dataflow operation, and sends 
the resulting value to two destinations. This process uses the 
Cfut tag and fault handler. 

A more efficient approach increases the granularity of each 
task to reduce scheduling overhead. We are building a sys¬ 
tem on top of the Berkeley TAM project 25 that addresses the 
inefficiencies of our earlier systems. 


We built the MDP to demonstrate the utility 

of general-purpose communication, synchronization, and 
naming mechanisms in a multicomputer building block. Its 
mechanisms efficiently support dataflow 26 and object-oriented 
programming 23 models using a global name space. The use 
of a few simple mechanisms provides orders of magnitude 
lower communication and synchronization overhead than is 
possible with multicomputers built from off-the-shelf micro¬ 
processors. Its communication and synchronization perfor¬ 
mance competes with processing nodes specialized to a single 
model of computation, such as iWarp 15 (systolic) or the trans¬ 
puter 13 (communicating sequential processes). 

Computers built from fine-grain processing nodes, such as 
the MDP, consisting of a small but powerful processor and a 
small memory, are more cost-effective than those built from 
fewer coarse-grain nodes. Fine-grain nodes devote a larger 
fraction of their silicon area to processing and have higher 
arithmetic, memory, and communication bandwidth per unit 
cost. Large-scale parallel machines built from fine-grain pro¬ 
cessors have a larger total amount of memory within a given 
latency of a processor. An efficient network design provides 
global memory latency and bandwidth competitive with 
coarse-grain machines. 

The MDP is a component for building scalable computer 
systems. It is useful in configurations ranging from one node 
to 65,536 nodes. A 128-node Jellybean Machine is currently 
operational and resources are in place to build several more 
machines, including a 1,024-node system at MIT and ma¬ 
chines at a number of other research institutions. 

The MDP project demonstrated the feasibility of building 
experimental computer systems with limited resources. By 
concentrating on the novel mechanisms of the MDP and keep¬ 
ing the design simple and modest in other respects, we com¬ 
pleted the design of the chip, its system-level hardware, and 
several programming systems with a handful (less than eight) 
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of graduate students and engineers in two and a half years. 

With the MDP we have begun exploring mechanisms for 
parallel computers. Much work remains to be done to tune 
the MDP’s mechanisms and compare them to alternatives. 
The demands of parallel software that drive these mecha¬ 
nisms are very different from the demands placed on se¬ 
quential computers. We find the design of mechanisms for 
parallel computers particularly challenging because no well- 
established parallel benchmarks exist. Additionally, most par¬ 
allel programs are very biased by the mechanisms (or lack 
thereof) of the machines for which they were initially written. 

Our software studies have suggested improvements that 
could be made to the MDP. More registers and better map¬ 
ping mechanisms would be useful. MDP's conservative imple¬ 
mentation leaves opportunities for streamlining, by decreasing 
the cycle time and number of clocks per instruction. A com¬ 
mercial, custom VLSI product based on the architectural 
mechanisms in the MDP is very plausible. 

As technology scales, we can put many powerful process¬ 
ing units on one chip. An interesting direction for further 
research is the extension of the MDP mechanisms to control 
intranode as well as intemode concurrency. The MIT M- 
Machine project, now in its early phase, takes this approach. 
It employs a processor-coupling mechanism to allow local 
processors to interact with single-cycle latency. [P 
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Organization of the Motorola 88110 
Superscalar RISC Microprocessor 


Motorola’s second-generation RISC microprocessor employs advanced techniques for exploit¬ 
ing instruction-level parallelism, including superscalar instruction issue, out-of-order instruc¬ 
tion completion, speculative execution, dynamic instruction rescheduling, and two parallel, 
high-bandwidth, on-chip caches. Designed to serve as the central processor in low-cost per¬ 
sonal computers and workstations, the 88110 supports demanding graphics and digital signal 
processing applications. 


Keith Diefendorff 

Michael Allen 

Motorola 


otorola designers conceived of the 
88000 RISC (reduced instruction- 
set computer) architecture to 
simplify construction of micropro¬ 
cessors capable of exploiting high degrees of in¬ 
struction-level parallelism without sacrificing clock 
speed. The designers held architectural complexity 
to a minimum to eliminate pipeline bottlenecks 
and remove limitations on concurrent instruction 
execution. 

The 88100/200 is the first implementation of 
the 88000 architecture. It is a three-chip set, re¬ 
quiring one CPU (88100) chip and two (or more) 
cache (88200) chips. The CPU’s simple scalar 
design uses multiple concurrent execution units 
with out-of-order instruction completion to ap¬ 
proach a throughput of one instruction per clock 
cycle. 

The second-generation, single-chip 88110 RISC 
microprocessor employs superscalar instruction 
issue and out-of-order instruction execution tech¬ 
niques to achieve a throughput greater than one 
instaiction per clock cycle. 

Overview 

In designing the 88110, we aimed at a general- 
purpose microprocessor, suitable primarily for use 
as the central processor in low-cost personal com¬ 


puters and workstation systems. Thus, our de¬ 
sign objective was good performance at a given 
cost, rather than ultimate performance at any cost. 
We recognized that the personal computer envi¬ 
ronment is moving toward highly interactive soft¬ 
ware, user-oriented interfaces, voice and image 
processing, and advanced graphics and video, 
all of which would place extremely high demands 
on integer, floating-point, and graphics process¬ 
ing capabilities. At the same time, we realized 
the 88110 would have to meet these performance 
demands while operating with the inexpensive 
DRAM systems typically found in low-cost per¬ 
sonal computers. 

To achieve the performance goals set for the 
88110, we needed to obtain more parallelism than 
was achieved in earlier microprocessors. To this 
end, we decided to use a superscalar micro¬ 
architecture to exploit additional instruction-level 
parallelism. Superscalar machines are distin¬ 
guished by their ability to dispatch multiple in¬ 
structions each clock cycle from a conventional 
linear instruction stream. This approach has shown 
good speedup on general-purpose applications 
and was a good match to available CMOS tech¬ 
nology. (We believe Agerwala and Cocke coined 
the superscalar term. 1 ) 

We selected the superscalar approach over 
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other fine-grain parallelism approaches, such as the vector 
machine and the VLIW (very long instruction word) approach, 
because it appeared to be more effective for our intended 
application. With a limited transistor budget, spending tran¬ 
sistors on vector hardware would have meant sacrificing sca¬ 
lar performance and would have yielded a machine that 
suffered from the Amdahl’s Law phenomenon 2 on general- 
purpose applications. 

The VLIW approach 3 would have introduced severe soft¬ 
ware compatibility restrictions, by exposing hardware paral¬ 
lelism to the object code program and thereby limiting future 
implementation flexibility. Also, the VLIW speedup, while 
substantial on code with abundant parallelism (such as sci¬ 
entific applications), is less significant on general-purpose 
applications. This limited speedup is due in part to the code 
expansion: The inefficient use of opcode bits to control un¬ 
used execution units and the aggressive loop unrolling re¬ 
quired to schedule the available execution unit parallelism 
effectively. 4 

The superscalar approach appeared to be a better match 
to CMOS technology than a superpipelined approach. Super¬ 
scalar designs rely primarily on spatial parallelism—multiple 
operations running concurrently on separate hardware— 
achieved by duplicating hardware resources such as execu¬ 
tion units and register file ports. 

Superpipelined designs, on the other 
hand, emphasize temporal parallel¬ 
ism—overlapping multiple operations 
on a common piece of hardware— 
achieved through more deeply pipe¬ 
lined execution units with faster clock 
cycles. As a result, superscalar ma¬ 
chines generally require more tran¬ 
sistors, whereas superpipelined 
designs require faster transistors and 
more careful circuit design to mini¬ 
mize the effects of clock skew. Some 
literature indicates that superscalar 
and superpipelined machines of the 
same degree would perform roughly 
the same. 5 We felt that CMOS tech¬ 
nology generally favors replicating 
circuitry over increasing clock cycle 
rates, since CMOS circuit density his¬ 
torically has increased at a much faster 
rate than circuit speed. 

The 88110 microarchitecture, illus¬ 
trated in Figure 1, employs a sym¬ 
metrical superscalar instruction 
dispatch unit, which dispatches two 
instructions each clock cycle into an 
array of 10 concurrent execution 
units. The design implements fully 


interlocked pipelines and a precise exception model, but it 
allows out-of-order instruction completion, some out-of-order 
instruction issue, and branch prediction with speculative ex¬ 
ecution past branches. 

We optimized each execution unit for low latency: The 
branch unit uses a branch target instruction cache to reduce 
branch latency. The integer and graphics units are one-cycle 
units; the floating-point adder and multiplier are three-cycle, 
fully pipelined, IEEE extended-precision units. The load/store 
unit provides fast access to the cache on a hit. But it is also 
highly buffered to increase tolerance to long memory latency 
on a miss and to allow dynamic reordering of loads and 
stores for runtime overlapping of tight loops. 

The on-chip caches are organized in a Harvard arrange¬ 
ment, giving the processor simultaneous access to instruc¬ 
tions and data. Each 8-Kbyte, two-way, set-associative cache 
provides 64 bits each clock cycle to its respective unit. The 
write-back data cache is nonblocking for some types of ac¬ 
cesses, and it follows a four-state MESI (modified, exclusive, 
shared, invalid) protocol 6 for coherence with other caches in 
multiprocessor systems. It also supports selective cache by¬ 
pass and software prefetching from user mode. Two inde¬ 
pendent, 40-entry, fully associative address translation 
look-aside buffers (TLBs) support a demand-paged virtual 



Figure 1. Block diagram of the 88110. 
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returning a 64-bit quotient. 

Additional information returned from the integer com¬ 
pare instruction to improve string-handling capability. 
Addition of static branch prediction to the branch 
opcodes, providing a mechanism by which the com¬ 
piler gives the hardware a hint as to the direction a given 
conditional branch is likely to go. We estimate that the 
compiler potentially can statically predict more than 85 
percent of the dynamically executed conditional branches 
correctly. We also believe that runtime branch profiling 
can further improve this rate in specific cases. 

An option that allows 16-bit* immediate address offsets 
and literal constants to be treated as signed numbers 
(the 88100 treats them only as unsigned). 


DRAM 

(c) 


Figure 2. Target system configurations: single (a), dual (b), 
and multiprocessors (c). 

memory environment. A common, 64-bit, external bus ser¬ 
vices cache misses. The demultiplexed, pipelined bus sup¬ 
ports burst mode, split transactions, and bus snooping. 

We designed the 88110 especially for the three basic sys¬ 
tem configurations shown in Figure 2: 

• single processors tightly coupled to low-cost DRAMs; 

• dual-processor systems, also coupled to inexpensive 
DRAMs, either in a symmetrical multiprocessing arrange¬ 
ment or with one of the 88110s dedicated to a particular 
function such as graphics or digital signal processing 
(DSP); and 

• medium-scale shared-memory multiprocessor systems, 
with each processor using local secondary static RAM 
(SRAM) cache, which we call L2. 

Instruction set architecture 

To improve performance, we extended the instruction set 
architecture of the 88110 beyond that of the 88100 micropro¬ 
cessor. We enhanced a number of the integer and floating¬ 
point instruction sets and added a new set of capabilities to 
support 3D, color graphics image rendering. All the enhance¬ 
ments are upwardly compatible with the 88100; that is, the 
88110 can run existing 88100 binaries. 

Base architecture extensions. We made the following 
minor enhancements of the base instruction set: 

• Extensions of integer multiply and divide to improve 
support for signed multiplication and for arithmetic on 
higher precision integers. Instructions permit multiplica¬ 
tion of two 32-bit numbers returning a full 64-bit result, 
and division of a 64-bit number by a 32-bit number, 


Floating-point architecture extensions. Our enhance¬ 
ments of the floating-point architecture were more signifi¬ 
cant. Anticipating heavy use of the processor for graphics 
and DSP and greater use of floating-point data in many general- 
purpose PC applications, we added the following: 

• An extended floating-point register file to provide regis¬ 
ter name space for floating-point variables and frequently 
accessed constants beyond that provided in the 88100 
architecture. The extended register file contains thirty- 
two 80-bit registers. Each register can hold one floating¬ 
point number of any precision—single, double, or 
double-extended. For compatibility with existing code, 
single- and double-precision floating-point numbers con¬ 
tinue to be supported in the general icgister file as well. 
The compiler can use the additional register name space 
to improve code schedules and reduce memory refer¬ 
ences. This feature alone results in a speedup of more 
than 15 percent on the SPEC (Systems Performance Evalu¬ 
ation Cooperative) floating-point benchmarks. 7 The 
speedup is substantially greater on many graphics and 
DSP-intensive routines. We expect further improvement 
as compilers learn to take better advantage of this feature. 

• Hardware support for IEEE-754, 80-bit, double-extended 
precision data, 8 to improve the accuracy and robustness 
of intermediate calculations in floating-point libraries. 

• Hardware support for arithmetic on infinities, to elimi¬ 
nate the need for trapping to an IEEE software envelope 
to handle infinities, a frequent occurrence in some graph¬ 
ics algorithms. 

• A time-critical floating-point mode to facilitate imple¬ 
mentation of real-time DSP algorithms. In this mode, the 
hardware attempts to deliver an arithmetically sensible 
result rather than trapping on exceptional conditions such 
as underflow, overflow, and NAN (not a number). 8 For 
example, the hardware flushes underflows to zero rather 
than trapping to generate the exact IEEE-specified 
denormalized result. This feature is useful in real-time 
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algorithms because it reduces execution time and elimi¬ 
nates data-dependent time variations, thereby increas¬ 
ing the amount of work that can be scheduled up to a 
given deadline. 

Graphics architecture extension. Fast processing of 3D 
graphics viewing transforms and lighting calculations requires 
high floating-point performance. However, good floating-point 
performance alone is not sufficient for good graphics perfor¬ 
mance. Due to the large amounts of data involved, graphics 
images are usually represented and rendered in packed, low- 
precision, fixed-point formats. Conventional microprocessors 
are not well suited to processing these data types. The tradi¬ 
tional solution of adding a special-purpose coprocessor to 
the system increases costs and creates the difficulty of 
seamlessly integrating another processor architecture into the 
software environment. 

A new set of instructions gives the 88110 this graphics 
capability. The hardware to implement these instructions takes 
only a small incremental investment in silicon (approximately 
2.5 percent), while substantially increasing performance on 
fixed-point shading and image processing. For many systems, 
these instructions eliminate the need for coprocessors or 
special-purpose external hardware. For systems demanding 
greater graphics performance, a dual-88110 system provides 
the coarse-grain parallelism of a coprocessor approach yet 
preserves a homogeneous programming and software 
environment. 

The new graphics instructions accelerate operations on the 
fixed-point and integer data types (Figure 3a,b) found in many 
3D, color image-rendering algorithms. They operate on these 
packed data types 64 bits at a time. 

M - 64 bits ----► 



(b) 


Figure 3. Graphics data formats: Packed integer data for¬ 
mats in pixels (a) and packed fixed-point data formats in 
color intensity values (b). 


. padd.t 

rD.rSl ,rS2 

; add fields 

. padds.x.t 

rD,rS1,rS2 

; add fields with saturation 

. psub.t 

rD,rS1 ,rS2 

; subtract fields 

. psubs.x.t 

rD.rSI ,rS2 

; sub fields with saturation 

. punpk.t 

rD,rS1 

; pixel unpack 

. ppack.r.t 

rD,rS1 ,rS2 

; pixel pack 

. pmul 

rD.rSI ,rS2 

; multiply 

. prot 

rD.rSl,rS2 

; rotate 

. prot 

rD.rSI ,<05> 

; rotate immediate 

. pcmp 

rD.rSl, rS2 

; Z compare 


Figure 4. Graphics instruction set. 

The graphics instruction set (Figure 4) provides addition 
and subtraction on 8-, 16-, and 32-bit fields within 64-bit op¬ 
erands using either modulo or saturation arithmetic. Satura¬ 
tion arithmetic allows overflows or underflows within a field 
to clamp at the maximum or minimum value representable 
in the field rather than wrapping around as in normal modulo 
arithmetic. This method can be useful, for example, when 
addition of a color intensity value could result in an overflow 
that, in modulo arithmetic, would alias to a lower intensity 
value and thus produce an undesirable visual anomaly. 
Saturation is available on signed, unsigned, and mixed-sign 
numbers. 

The set includes instructions for unpacking, truncating, 
packing, and rotating 4-, 8-, 16-, and 32-bit data fields to 
quickly convert between packed, fixed-point formats (inten¬ 
sity values) and packed, short, integer formats (pixels). 

A graphics multiply instruction supports image-processing 
and -compositing algorithms, and a 64-bit compare instruc¬ 
tion allows comparison of two pairs of 32-bit, fixed-point or 
floating-point Z-buffer values in one instruction. 

Figure 5 (on the next page) is an example of how these 
primitive instructions can be chained together to implement 
complex image-processing operations such as compositing. 
A four-instruction sequence is illustrated. 1) The punpk in¬ 
struction unpacks a 32-bit, four-channel, true-color pixel into 
four 16-bit, zero-padded, fixed-point numbers. 2) Pmul mul¬ 
tiplies the result by an 8-bit integer, producing four new 16- 
bit, fixed-point numbers. 3) Padd adds these results to four 
other 16-bit, fixed-point numbers. 4) Ppack truncates the four 
16-bit results of the addition to 8 bits, packs them together, 
and accumulates them with the pixel computed in the previ¬ 
ous iteration of the loop by shifting the old pixel to the left 
and inserting the new pixel in its place. The program then 
can write this two-pixel result to the image buffer in memory, 
using a double-word store (St.d). 

An important characteristic of the graphics instructions is 
their clean integration with the 88000 architecture, made pos¬ 
sible by the 88000’s special-function unit concept, which al¬ 
lows the instruction set architecture to be easily extended. 
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Figure 5. Graphics instruction chaining. 

All the new instructions comply with the RISC philosophy of 
instruction set design. 940 They all take two 64-bit operands 
from the general register file, perform a simple operation, 
and produce a single 64-bit result. No instruction side-effects 
or special-purpose “kludge registers” are introduced into the 
programming model. Since data is kept in the general regis¬ 
ter file, all the existing 88000 arithmetic, logical, and bit-field 
instructions can be freely applied to the graphics data. Fur¬ 
thermore, the storage of data in the general register file al¬ 
lows the fixed-point graphics operations to overlap 
floating-point graphics operations, creating a high-through¬ 
put graphics pipeline. 

Instruction fetch and issue 

The heart of the 88110 microarchitecture is a centralized 
instruction sequencer, which dispatches instructions into an 
array of parallel execution units (Figure 6). The sequencer 
fetches instructions from memory, tracks resource availabil¬ 
ity and interinstruction dependencies, directs operand flow 
between the register files and execution units, and dispatches 
instructions to the individual execution units. 

On each clock cycle, the sequencer fetches two instruc¬ 
tions from the instruction cache and two from the branch 
target instruction cache. It decodes the appropriate instruc¬ 


tion pair while fetching the necessary data operands from 
the register files. If all the required execution units and oper¬ 
ands are available, the sequencer simultaneously dispatches 
both instructions to their respective execution units. 

Instructions leave the sequencer in strict program order. 
The sequencer always tries to dispatch two instructions; if it 
can’t, it tries to dispatch at least the first of the pair. In that 
case, the second instruction moves into the first issue slot, a 
new instruction is fetched to replace it, and the new instruc¬ 
tion pair tries to issue on the next clock cycle. 

Although the sequencer always dispatches instructions in 
order, not all instructions issue, or begin execution, in order. 
Reservation stations 1142 in their respective execution units al¬ 
low branches and stores to be dispatched even if their source 
operands are not available, so that further instruction dis¬ 
patch can continue. Branch and store instructions wait in the 
reservation stations until the required source operands be¬ 
come available and the instructions can issue. Thus, branches 
and stores may issue out of order. This dynamic reschedul¬ 
ing 13 ensures that branches and stores, which normally con¬ 
stitute about 30-40 percent of the dynamic instruction mix, 
rarely stall on data dependencies and do not delay the dis¬ 
patch of subsequent instructions. 

Once the sequencer has dispatched an instruction into an 
execution unit pipeline, the instruction proceeds at a pace 
set by the capability of that unit. When an execution unit 
finishes an instruction, the sequencer controls write-back of 
the results into the register file and forwarding of the results 
to any execution unit that needs them immediately. The se¬ 
quencer ensures that no register conflicts exist, but it is other¬ 
wise free to update the register file out of program sequence. 
This out-of-order instruction completion model allows useful 
work to proceed under long-latency operations. It also al- 



Figure 6. Instruction dispatch. 
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lows a mixture of execution unit types, possibly with vari¬ 
able-length pipelines, without resorting to a common, long, 
fixed-length pipeline with complicated pipeline bypass 
circuitry. 

The master instruction pipeline, shown in Figure 7, is a 
conventional, four-stage RISC pipeline that completes most 
instructions in three clock cycles. In the first stage of the 
pipeline, the sequencer fetches an instruction pair from the 
instruction cache. In the second stage, it decodes these two 
instructions, fetches their operands from the register files, 
and decides whether or not to dispatch them into execution. 
Instructions execute during the third stage; for most instruc¬ 
tions the execute stage requires one clock cycle, but some 
take more. In the fourth and final stage, the sequencer writes 
the results from the execution units into the register files. 

Three things can prevent an instruction from issuing: 1) A 
necessary resource is not available or is busy (structural haz¬ 
ard); 2) an operand conflict exists with a prior instruction 
(data hazard); or 3) a branch causes a change in program 
flow, requiring an alternate instruction stream to be fetched, 


thus temporarily starving the dispatch unit of instructions (con¬ 
trol hazard). 4 

Structural hazards. Stiuctural hazards occur because of 
pipeline resource or instruction class conflicts. Pipeline con¬ 
flicts are rare in the 88110 because the register files are 
multiported with full-width data paths, and all execution units 
(except the divider) either execute in one cycle or are fully 
pipelined to accept a new instruction each clock cycle. 

Instruction class conflicts occur when two instructions re¬ 
quiring the same execution unit attempt to issue on the same 
clock cycle; for example, two multiply instructions attempt to 
issue as a pair, but only one multiplier execution unit exists. 
The concurrency matrix in Figure 8 on the next page shows 
the relatively few pairings of instructions in the 88110 that 
will stall due to a class conflict. We eliminated a significant 
number of class conflicts by providing a duplicate set of inte¬ 
ger ALUs (arithmetic logic units). 

An important aspect of the 88110’s superscalar instruction 
issue capability is that it is symmetrical; that is, any instruc¬ 
tion can be dispatched from either slot in an instiuction dis- 



Figure 7. Master instruction pipeline; RF indicates register file. 
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Figure 8. Instruction concurrency matrix. 

patch pair, as illustrated in Figure 9. For example, the se¬ 
quencer can dispatch a multiply instruction to the multiplier 
regardless of whether the instruction is in the first or second 
slot of a dispatch pair. Thus, the 88110 has none of the artifi¬ 
cial instruction ordering or pairing restrictions characteristic 
of VLIW or restricted superscalar machines. Also, the sequencer 
fetches instructions from the instruction cache two at a time 
regardless of their address alignment, so no alignment re¬ 
strictions must be met. Removing these constraints frees the 
compiler to optimize for more important considerations. 

Data hazards. Instruction issue can also stall because of 
data hazards such as read-after-write (true data dependency), 
write-after-write (output dependency), or write-after-read 
(antidependency) hazards. Read-after-write hazards occur 
when an instruction needs a result from a previous instruc¬ 
tion that has not yet completed execution. Write-after-write 
hazards occur when an instruction writes to a register after a 


subsequent instruction has already writ¬ 
ten to the same register, thus leaving the 
register with old data. Write-after-read 
hazards occur if an instruction attempts 
to write a result to a register before a 
previous instruction reads the old value. 
The write-after-write and write-after-read 
hazards are really false dependencies, 
since they involve no true data depen¬ 
dency, only a register name conflict. 

A register-busy scoreboard automati¬ 
cally interlocks the 88110 pipeline against 
incorrect data on hazards by tracking 
source and destination operand avail¬ 
ability. Each time the sequencer dis¬ 
patches an instruction, it also marks the 
instruction’s destination register as busy 
until the instmction completes execution. 
As the sequencer considers instructions 
for dispatch, it checks the scoreboard to 
ensure that no register conflicts exist with 
prior instructions still in execution. (The 
term scoreboard , as applied to comput¬ 
ers, originally referred to the complex 
centralized queue and reservation 
mechanism used in the CDC 6600 for 
tracking all aspects of out-of-order ex¬ 
ecution. 14 Recently, however, the term 
has become generic, referring to any 
control unit that handles register reser¬ 
vations 12 —including much less sophisti¬ 
cated units than the CDC 6600’s—such 
as that in the 88110.) 

The sequencer avoids incorrect data 
on read-after-write hazards by checking 
the source operand register scoreboard 
bits and on write-after-write and on write-after-read hazards 
by checking the destination operand register scoreboard bit. 
The sequencer can dispatch branches and stores even if their 
source operand is busy. However, they are held in a reserva¬ 
tion station and do not begin execution until the scoreboard 


Instruction stream 



Figure 9. Symmetrical superscalar instruction issue. 
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bit for the needed register clears and the operand can be 
read. 

Of the three types of data hazards, read-after-write hazards 
cause the most instruction issue stalls in the 88110. We keep 
these stalls short by providing 1) low-latency execution units 
and 2) a register file bypass from the execution unit result 
buses to the execution unit inputs. The bypass makes in¬ 
struction results available immediately to subsequent instruc¬ 
tions without having to wait an extra clock cycle for results to 
be written into and read out of the register file. 

For the most part, the 88110 relies on static scheduling of 
the instruction stream to avoid stalling on hazards. In many 
cases, static scheduling is straightforward and the compiler 
can handle it effectively. However, statically scheduling code 
around some types of data hazards can be difficult or ineffi¬ 
cient; that is why the 88110 performs dynamic rescheduling 
pf branches and stores. 

As an example of dynamic rescheduling of stores, con¬ 
sider the common operation of fetching data from memory, 
performing some computation on it, and then storing the 
result back in memory. If the computation requires multiple 
cycles—as a floating-point multiply does, for example—the 
store of the result introduces a data hazard 
that would stall instruction issue even 
though no further need exists for that data 
in the program. The store reservation sta¬ 
tions allow stores to be set aside and in¬ 
struction issue to continue while the store 
data is being computed. Then, when the 
store data becomes available, the sequencer 
immediately forwards it to the appropriate 
reservation station and allows execution of 
the store to begin. 

Similar stalls could occur on conditional 
branch operand data hazards—due either 
to long-latency operations (such as load- 
branch sequences) or to dispatch pair de¬ 
pendencies (such as compare-branch 
pairs). As with stores, the 88110 provides a 
reservation station to avoid stalling on these 
branches. In the case of branches however, 
an additional problem exists. The machine 
does not know where to continue execu¬ 
tion until the branch operand is available 
and the sequencer can evaluate the condi¬ 
tion. Therefore, the sequencer predicts the 
branch direction, and instructions down the 
predicted path execute conditionally, or 
speculatively, until the branch operand is 
resolved. The static prediction of the branch 
direction is based on the opcode of the 
branch instruction. 

The branch reservation station provides 


a place to set aside the branch instruction so that instruction 
issue can continue while the branch condition is being re¬ 
solved. Once the operand becomes available and the condi¬ 
tion is evaluated, the machine determines whether or not 
instruction execution actually went down the conect path. If 
it did, useful work was accomplished and execution simply 
continues uninterrupted. If the prediction was incorrect and 
execution went the wrong way, the machine backs up to the 
branch, undoing all changes made to the registers by condi¬ 
tionally executed instructions, and resumes execution down 
the other path. 

Figure 10 contrasts the pipeline situation with and without 
speculative execution on a taken conditional branch (bend) 
that is dependent on a load. With speculative execution and 
branch prediction (top), the new instruction stream (target 0, 
target 1, and so on) begins execution immediately with no 
bubbles introduced into the pipeline. (By bubbles we mean 
lost opportunities for instructions to issue.) Without specula¬ 
tive execution and branch prediction (bottom), the machine 
would continue fetching down the sequential instruction 
stream (next 0, next 1), since the target address would not 
yet be available. Also, instruction dispatch down the target 
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Figure 10. Speculative execution. 


April 1992 47 
































































































88110 RISC 


path would have to delay for two clock cycles (four bubbles) 
while the machine waits for load data with which to com¬ 
pute the branch direction. 

During speculative execution, the instruction fetches that 
miss the instruction cache access the bus just as in normal 
execution. Load instructions can access the data cache on a 
hit, but the bus will not service a data cache miss until the 
branch condition resolves. Store instructions can be dispatched 
to the reservation stations but can never access the cache or 
the bus during conditional execution. This procedure pre¬ 
vents corruption of the memory image by a store instruction 
that is eventually canceled on a misprediction. 

The accuracy of branch prediction and the penalty for 
mispredicting are important, of course, to the overall perfor¬ 
mance gain realized from speculative execution. Although 
prediction accuracy depends on the compiler being used and 
the nature of the application, our simulations indicate that 
good static-prediction accuracy is achievable. On the SPEC 
benchmarks over 80 percent of all conditional branches take 
the anticipated path, and over 70 percent of the branches 
that need to be predicted are being predicted correctly. Cur¬ 
rently, our compiler predicts only the simple branch cases, 
so we expect these results to improve as the compiler be¬ 
comes more aggressive. Also, we are currently seeing a pen¬ 
alty of less than one-half percent for mispredicting branches, 
although the penalty may increase on applications that allow 
deeper speculative execution. 

Control hazards. Generally speaking, when a pipelined 
processor encounters a branch, it needs time to evaluate the 
condition, compute the target branch address, fetch instruc¬ 
tions from die new target instruction stream, and refill the 
pipeline. The 88110 deals with pipeline bubbles caused by 
control hazards by means of the speculative execution model 
just described and by the use of a branch target buffer to 
shorten branch execution latency. 15 

The speculative execution model permits out-of-order in¬ 
struction execution to extend beyond the domain of a basic 
block. It also helps keep the instruction pipeline full and the 
execution units busy even in the face of small basic blocks, 
but it does not address branch latency. 

RISC designers traditionally compensated for branch la¬ 
tency by using a branch-and-execute 9 or delayed-branch 16 
strategy to give the processor something to do while the 
branch executes. In a superscalar design, however, a single 
architectural delay slot is insufficient to cover the two instruc¬ 
tion bubbles inserted into the instruction pipeline by each 
clock cycle of branch latency. Short branch latency is impor¬ 
tant, even with speculative execution, to minimize the num¬ 
ber of instructions subject to cancellation in the event of 
misprediction. 

Due to the critical importance of control hazards to perfor¬ 
mance, we invested a significant amount of circuitry in the 
88110 to reduce branch latency. During the instruction de¬ 


code phase of the pipeline, the sequencer fetches two in¬ 
structions at the branch target address from the branch target 
instruction cache (TIC) and supplies them as the first two 
instructions down the branch-taken path. The sequencer evalu¬ 
ates (or predicts, if necessary) the branch condition early in 
the pipeline to select either the target instruction pair from 
the TIC or the next sequential instruction pair from the in¬ 
staiction cache in time for the next instruction decode phase. 
By this time, with the branch target address computed, the 
instaiction cache can supply further instaictions along the 
target instruction stream. Thus, on a hit, the TIC can fill the 
two branch pipeline delay slots with useful instaictions. 

The TIC has 32 entries and is fully associative. Each entry 
in the TIC holds the first two instructions from a recently 
taken branch target path. The hardware automatically loads 


Even with its heavily pipelined, 
out-of-order execution model, 
the 88110 implements fully 
precise exceptions. 

cache entries on a miss, following a FIFO replacement policy. 
The sequencer uses the logical address of the branch being 
evaluated to index the TIC. Thus, the target instructions are 
available immediately for the next instruction fetch phase of 
the pipeline. 

TIC hit rates depend heavily on the application, but our 
simulations show an average TIC hit rate of around 85 per¬ 
cent on the SPEC benchmark suite. 

Exceptions 

Program exceptions and interrupts have always been a 
problem in pipelined machines, especially those that execute 
instructions out of order and/or speculatively. During normal 
program execution, machine state changes can occur out of 
program sequence as long as the state currently relevant to 
the program appears to be in the correct program order. But 
in a parallel machine, the internal pipelines and buffers tem¬ 
porarily hold much of the dynamic machine state. Thus, at 
any point in time, the register files do not completely reflect 
the true current machine state. So, when a program excep¬ 
tion occurs and the instantaneous state of all the registers 
become manifest, the dynamically held state can be lost or 
confused. The register file’s inconsistency with the actual 
machine state makes correct recovery from the exception 
difficult or impossible. 

Even with its heavily pipelined, out-of-order execution 
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model, the 88110 implements fully precise exceptions. That 
is, the processor always presents the architecturally correct 
state to an exception-handling routine. It also gives an exact 
indication of which instruction caused the fault and where to 
resume execution. The 88110’s precise exception model dra¬ 
matically simplifies and speeds up exception- and interrupt¬ 
handling software routines. 

When an instruction generates an exception—for example, 
a page fault or an arithmetic overflow—instruction execution 
continues until all instructions that issued prior to the faulting 
instruction complete. (This step ensures that all synchronous 
exceptions occur in strict program order). At this point, ex¬ 
ecution stops, the internal pipelines are cleaned up, and the 
machine backs up to the instruction that caused the excep¬ 
tion, leaving all registers in the precise architectural state that 
existed before the faulting instruction issued. The 88110 ac¬ 
complishes this by means of a history buffer, 17 which records 
the relevant, user-visible machine state as instructions issue. 
The processor uses information stored in the history buffer to 
quickly restore the machine state back to the point of the 
exception. This is the same mechanism the 88110 uses to 
recover from mispredicted branches. 

When the machine recognizes an asynchronous external 
interrupt, it halts execution, aborts all unfinished instructions 
(or waives write-back in the case of a memory transaction in 
progress on the bus), and backs out the effects of any in¬ 
structions that completed out of order. This procedure mini¬ 


mizes interrupt response latency, a critical parameter in many 
real-time system applications. 

Register files 

The 88110 has two sets of register files, the formats of 
which are shown in Figure lla,b. The general register file 
primarily holds fixed-point values and address pointers; the 
extended register file holds floating-point data. 

The general register file has thirty-two 32-bit registers, which 
can be used by all instructions in the machine. These registers 
are accessible in pairs to supply 64-bit operands whenever 
necessary—for example, for graphics and double-precision 
floating-point values. Register zero is hardwired to the inte¬ 
ger constant zero (0). 

The extended register file is a new addition to the original 
88000 architecture. It contains thirty-two 80-bit registers, which 
are used exclusively by the floating-point instructions. Each 
extended register can hold one floating-point number in ei¬ 
ther single, double, or double-extended format. Register zero 
in the extended register file is hardwired to the floating-point 
constant positive zero (+0.0E00). We provided instructions 
that quickly move data back and forth between the two reg¬ 
ister files. 

Both eight-ported register files can supply all the operand 
bandwidth required to sustain the peak instruction issue rate 
of two instructions each clock cycle, regardless of the in¬ 
struction mix or data precision. Data-forwarding paths around 
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Figure 11. Register files; General (a) and extended or floating-point (b). 
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Figure 12. Operand data paths. 

the register files route a result returning from an execution 
unit directly to the inputs of a waiting execution unit while 
the result is also being written into the register file (see Fig¬ 
ure 12). This approach avoids stalling the instruction issue an 
extra clock cycle while data is written into the register before 
it can be read out on a source port. 

Execution units 

The 88110 contains 10 independent execution units: branch, 
integer (two), bit field, multiplier, floating-point adder, di¬ 
vider, graphics (two), and data or load/store. The dataflow 
paths from outside the chip, through the caches and register 
files, and into and out of the execution units, are shown 
schematically in Figure 13- 

Integer units. The integer units are simple 32-bit ALUs 
that handle all the fixed-point arithmetic instructions and logi¬ 
cal instructions. The execution latency of both integer units 


is one clock cycle. 

Bit field unit This unit is a shifter/masker 
circuit that handles the 88000’s extensive set 
of bit-field manipulation instructions. It also 
has a single-clock-cycle execution latency. 

Multiplier unit. The multiplier unit 
handles all 32- and 64-bit signed and un¬ 
signed integer multiplies, the graphics 
multiply, and the single-, double-, and 
extended-precision floating-point multi¬ 
plies. The fully pipelined unit can start a 
new multiply instruction every clock cycle. 
The multiplier has an execution latency of 
three clock cycles for all data types. The 
32 x 64-bit multiplier uses Booth partial 
product generators and a Wallace tree to 
sum the partial products twice each clock 
cycle to maximize circuit efficiency. 

Floating-point adder unit. Tire float¬ 
ing-point adder executes all single-, 
double-, and extended-precision floating¬ 
point add, subtract, compare, and integer 
conversion instructions. The fully pipelined 
unit can start a new instruction on every 
clock cycle. The adder has a three-clock- 
cycle execution latency for all precisions. 
A special shortcut reduces the latency for 
floating-point compare to one clock cycle. 
The dynamic 64-bit adder circuit uses a 
combined block-carry-look-ahead and fast- 
carry-select scheme. The actual 64-bit add 
time is much shorter than one clock cycle. 
Most of the three-clock-cycle execution 
time occurs because of the floating-point 
format operations such as reserved-oper¬ 
and check, exponent debiasing, mantissa 
alignment, normalization, and rounding. 

The construction of a fully pipelined floating-point multi¬ 
plier and adder with short latencies is very hardware inten¬ 
sive. The return on this investment is the ability to achieve a 
more efficient static code schedule and an extremely high 
floating-point throughput. Long-latency operations require that 
a large number of independent instructions be available to 
be scheduled into the pipeline delay slots to avoid bubbles 
and keep execution unit usage high. In general, the compiler 
needs to find less program parallelism on hardware with short 
latencies than it does on hardware with long latencies to 
achieve the same level of performance. 

One commonly used technique for scheduling around long- 
latency operations is loop unrolling. This technique increases 
the basic block size and the number of data-independent 
operations available to be scheduled into pipeline delay slots. 
A difficulty with loop unrolling is that it requires a large reg- 
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Figure 13. Organization of the 88110. 

ister name space so registers can be allocated to avoid data 
hazards. It also increases the static code size, which can waste 
memory space and instruction cache entries. The 88110’s short, 
three-cycle floating-point latency is well balanced with the 
large register files, making a small amount of loop unrolling 
effective without requiring elaborate register-renaming 
hardware. 11 

Divider unit. The divider handles all 32- and 64-bit signed 
and unsigned integer divides and all single-, double-, and 
extended-precision floating-point divides. The iterative di¬ 
vider uses a radix-8-per-clock SRT algorithm (Sweeney- 
Robertson-Tosher) with a latency dependent on the operand 
type and precision. The execution latency of single-precision 
floating-point division equals 13 clock cycles. 

Graphics units. Two execution units implement the new 
graphics instructions. One handles the arithmetic operations, 
and the other handles the bit-field packing and unpacking 
instructions. Both units have a single-clock-cycle execution 
latency. Because the two units are independent, each can 
accept a new instruction every clock cycle. In fact, the in¬ 


structions are partitioned in a manner that often makes it 
possible to schedule graphics algorithms to sustain a through¬ 
put of a full two instructions each clock cycle. As an ex¬ 
ample, Figure 14 on the next page shows execution of the 
inner loop of a simple Gouraud shading algorithm. Since the 
graphics units behave the same as other execution units, graph¬ 
ics instructions can issue together with any other integer, 
floating-point, or memory-referencing instruction. This flex¬ 
ibility minimizes loop overhead and allows very efficiently 
scheduled graphics routines. 

Load/store unit. The load/store unit is the most sophisti¬ 
cated execution unit in the 88110. We invested a considerable 
amount of circuitry in this unit because of the critical impor¬ 
tance of memory referencing to overall performance. The unit 
provides a stunt box 14 capability for holding memory refer¬ 
ences that are waiting for the memory system and allows dy¬ 
namic reordering of loads past stalled stores. (The CDC 6600’s 
stunt box did not allow loads to pass stores, but the term stunt 
box has come to refer to any device that allows reordering of 
memory references in the memory system. 12 ) 
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Figure 14. Graphics execution unit parallelism. 

The load/store unit executes all instructions that transfer 
data between the data cache, or bus interface, and the regis¬ 
ter files. The data path from the load/store unit to the register 
files is a full 80-bits wide. Load latency for 32- and 64-bit data 
on a cache hit is two clock cycles—one longer than a normal 
integer add instruction. 

On each clock cycle, the unit can accept one new load or 
store instruction from the instruction dispatch unit. When an 
instmction is dispatched to the load/store unit, it awaits ac¬ 
cess to the data cache in either the load queue or the store 
queue 12 (see Figure 15). Normal instruction dispatch and ex¬ 
ecution can continue while these instructions await service 
by the cache or memory system. On properly scheduled code, 
this buffering provides considerable tolerance of long memory 
latency. 

The load queue is a simple, four-deep FIFO queue. The 
store queue is a somewhat more complex three-deep reser¬ 
vation station that is also managed as a FIFO queue. Since 
store instructions can be dispatched before the store data op¬ 
erand is available, store instructions wait in the store queue 
until the instruction computing the required data completes 
execution. When the operand becomes available, the se¬ 
quencer directs it into the store reservation station, and the 
associated store instruction becomes a candidate for access to 
the data cache. 

If a store instruction stalls in the reservation station waiting 
for its operand, subsequently issued load instructions can 
bypass the store and immediately access the cache. An ad¬ 
dress comparator detects address hazards and prevents loads 
from going ahead of stores to the same address, thus getting 
stale data. This load/store reordering feature allows runtime 
overlapping of tight loops by permitting loads at the top of a 
loop to proceed without having to wait for the completion of 
stores from the bottom of the previous iteration of the loop. 

The data cache is nonblocking, or lock-up free, 18 for store¬ 


load accesses. For example, when a 
load bypasses a store and misses the 
cache, the cache can be decoupled 
from the bus so that the store can ac¬ 
cess the cache while the bus waits for 
memory. This is also true for a store 
miss followed by a load hit and for 
the user-mode touch-load instmction. 

Touch-load provides a limited form 
of decoupling of load-store and load¬ 
load sequences. This instruction pro¬ 
vides a mechanism for a program to 
bring a cache line into the cache be¬ 
fore it is actually needed. While the 
load/store unit waits for memory to 
deliver the data, instruction issue can 
continue unrestricted. During that 
time, load and store instructions can 
access the cache. The programmer can use the touch instruc¬ 
tion to prefetch data into the cache to avoid load misses that 
would likely stall execution if serviced on demand. When 
used properly, these instructions can significantly increase 
cache hit rates and minimize load miss penalties. 

The load/store unit implements other user-mode instruc¬ 
tions to allow more effective scheduling of the data cache. 
An allocate instruction allocates a line in the cache without 
first bringing the line from memory. A program can use this 
instruction to avoid unnecessary bus transactions in cases 
where it will overwrite the entire cache line anyway. In addi¬ 
tion, a line flush instruction can force a cache line out to 
memory. The line flush provides a mechanism to update a 
video frame buffer without allocating the frame buffer as write- 
through storage and thereby sacrificing the burst-mode line 
transfer capability of the bus. 

The load/store unit also contains a selective cache bypass 19 
capability for stores. Store instructions can be selectively 
marked (with an opcode bit) to “store-through” the cache. 
Such a store bypasses the cache and proceeds directly to 
memory. If the reference hits the cache, the cache is updated; 
but if it misses the cache, no new line is allocated. This fea¬ 
ture prevents pollution of the cache with data known to be of 
no further use to the program. It can increase cache hit rates 
by avoiding replacement of more useful entries in the cache. 
It can also improve cache miss latency by reducing the num¬ 
ber of “dirty” lines that have to be copied back to memory 
when a new cache line is allocated. 

The load/store unit also executes the 88000’s XMEM in¬ 
struction, which performs an atomic read-write operation that 
exchanges the contents of a register with a memory location. 
A program can use this instruction to implement semaphores 
and shared-resource locks in multiprocessor systems. 20 It can 
be used as a primitive to construct a wide variety of complex 
synchronization protocols, such as spin-lock, compare-and- 
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swap, and fetch-and-add. For example, 
one can construct an efficient spin-lock 
by repeatedly polling a lock with loads, 
which will hit in the cache and therefore 
not generate bus traffic. When the current 
owner of the lock releases it, the cache 
coherence logic brings the new copy of 
the lock into the cache, and the processor 
can then try to acquire the lock with an 
XMEM. The resulting indivisible read-write 
bus transaction ensures exclusive owner¬ 
ship of the lock. 

The 88000 architecture uses primarily a 
“big-endian” byte order—that is, an ad¬ 
dress points to the most significant byte 
of a datum in memory—as opposed to a 
“little-endian” order—in which an address 
points to the least significant byte of a 
datum. The 88110 provides a solution for 
heterogeneous big/little-endian multipro¬ 
cessor systems with a mode switch that 
allows data memory references to be per¬ 
formed in either big- or little-endian fash¬ 
ion. In little-endian mode, the load/store 
unit swaps the bytes in all half-words, 
words, double-words, and quad-words as 
they transfer into or out of the cache. 

Address translation facilities 

The 88110 offers full hardware support 
for a demand-paged virtual memory sys¬ 
tem. 21 It provides hardware facilities for 
translating logical effective program ad¬ 
dresses to physical memory addresses, for 
protecting areas of memory from 
unprivileged accesses, and for trapping to 
supervisory routines on accesses to pages 
not currently in memory. The organiza¬ 
tion of the address translation facilities is 
diagrammed in Figure 16. 

The processor contains two indepen¬ 
dent and concurrent translation look-aside 
buffers—one for translating instruction ad¬ 
dresses, the other for data addresses. Each 
fully associative TLB can hold thirty-two 
4-Kbyte page address translation entries. 
Each entry contains a translation descrip¬ 
tor that maps a virtual page to its corre¬ 
sponding physical page number. 

On each memory reference, the hard¬ 
ware looks up the logical address (which 
is equal to the virtual address in the 88110) 
by simultaneously comparing it to all en- 



Figure 15. Load/store execution unit. 



Figure 16. Virtual address translation facilities. 
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Figure 17. Translation look-aside buffer (TLB). 

tries in the TLB, using content-addressable memory (CAM) 
elements as illustrated in Figure 17. Each CAM element is 
associated with one physical page descriptor, which is loaded 
into the TLB from the memory-based page tables maintained 
by the virtual memory operating system software. 

If the TLB lookup finds an entry that matches the logical 
address of the memory reference being translated, it is a TLB 
hit and that entry is used to translate the address. If the memory 
reference does not violate the access privileges specified by 
the selected TLB entry, the page offset bits from the least 
significant 12 bits of the logical address form the final trans¬ 
lated physical address. If the reference does violate the ac¬ 
cess privileges, the processor aborts the memory reference 
and signals an attempted memory protection violation to the 
operating system by taking an access exception trap. 

Information stored in the TLB also governs certain caching 
policies, such as global access and cache bypass. The global 
property indicates that the referenced memory page is shared, 
and, therefore, other processors on the bus must “snoop” 
(watch for) any external bus transaction generated as a result 
of a cache miss to this page. The cache-inhibit and write- 
through properties allow data cache bypass to be controlled 
on an address basis at the granularity of a page. When a page 


is marked “cache-inhibited,” all refer¬ 
ences—loads or stores to that page— 
bypass the cache. If a page is marked 
“write-through,” all stores to that page 
bypass the cache. This write-through ca¬ 
pability is similar to the selective cache 
bypass (store-through) feature described 
earlier, but it allows bypassing on an ad¬ 
dress (page) basis rather than an instruc¬ 
tion basis. 

If the memory reference matches an 
entry in the TLB (a hit) but the entry is 
marked “invalid,” the hardware generates 
a page fault trap, and control transfers to 
a supervisory routine. This routine would 
bring the accessed page in from disk, up¬ 
date the system page tables, and reissue 
the faulting memory instruction. 

If the memory reference does not 
match any entry in the TLB (a miss), one 
of two things can happen. The hard¬ 
ware can automatically walk through the 
operating system’s page tables to load a 
new descriptor into the TLB before start¬ 
ing the memory transaction. Or, the hard¬ 
ware can generate an exception, invoking 
a software routine to manually load a new 
descriptor into the TLB and then rerun 
the faulting memory instruction. The soft¬ 
ware TLB-reloading mechanism provides 
the flexibility to support virtual memory management sys¬ 
tems that use table structures (such as inverted page tables) 
different from those supported directly by hardware. 

The simple but efficient hardware table-walking algorithm 
illustrated in Figure 18 indexes through two levels of system 
segment and page tables to locate a page descriptor, which 
is then brought into the TLB. The algorithm includes an indi¬ 
rection capability, which treats the resolved page descriptor 
as a pointer that is followed one additional level to a final 
page descriptor. Indirection allows multiple virtual addresses 
that map (alias) to the same physical address to be mapped 
through a common page descriptor. This capability simpli¬ 
fies system maintenance of the page referenced and modi¬ 
fied status bits, typically used to implement an efficient 
demand-paged, virtual memory management system. 

Each TLB implements facilities for the operating system to 
determine whether a particular translation is currently in the 
TLB, for locating and invalidating individual entries in the 
TLB, and for invalidating all user or supervisor entries. These 
facilities support many aspects of virtual memory manage¬ 
ment, including TLB coherence protocols such as the Mach 
operating system's TLB “shoot-down” algorithm. 22 

Another important feature of the TLBs is the block address 
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translation facilities. In addition to the 
32 page entries already described, each 
TLB has eight variable-size block address 
translation entries. The block translation 
entries perform address translation in a 
similar fashion to the page entries but 
are capable of mapping large blocks of 
memory (512 Kbytes to 64 Mbytes). 
These entries allow mapping of large 
static areas of system code or data staic- 
tures and large entities such as frame 
buffers, without using an excessive num¬ 
ber of TLB page entries. 

Caches 

The 88110 has a Harvard-style inter¬ 
nal architecture—that is, it has separate, 
independent instruction and data paths. 
An on-chip instruction cache feeds the 
instruction unit, and an on-chip data 
cache feeds the load/store execution unit. 
Cache misses are multiplexed together 
and are serviced from a common exter¬ 
nal bus interface. 

The instruction and data caches are 
both physically addressed. Physical 
caches have the advantage over logical 
caches in that synonyms do not occur. 
As a result, special precautions to disam¬ 
biguate logical addresses across differ¬ 
ent process contexts are not necessary. 
No extra hardware is required for asso¬ 
ciating a logical address with a specific 
process, nor is it necessary to incur the 
overhead of flushing the caches on a 
context switch. Physically addressed 
caches also simplify maintenance of 
cache coherency in multiprocessor 
systems. 

In the 88110 implementation, the 
caches are logically indexed and physi¬ 
cally tagged, as illustrated in Figure 19. 
The cache arrays are directly indexed 
with the 12 untranslated page offset bits 
from the least significant portion of the 
logical address. Each cache line is tagged 
with the high-order 20 bits of the fully 
translated physical address. This arrange¬ 
ment allows selection of the cache set 
and retrieval of the cache tags and data 
in parallel with the translation of the logi¬ 
cal address to a physical address in the 
TLB. After the physical address becomes 


User/supervisor/ 
instruction data 



Logical (effective) address _ 

20 I 7 l2l 3 I 



Data (instructions) 


Figure 19. Organization of instruction and data caches. 
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32-bit address bus 64-bit data bus 



Figure 20. Cache miss. 

available, the machine compares it against the two tags (one 
associated with each line of the selected cache set) to deter¬ 
mine a hit or miss and make the final line selection. 

Instruction cache. The 8-Kbyte, two-way, set-associative 
instruction cache is organized into 128 sets, with two lines 
for each set, and 32 bytes (eight instructions) in each line. 
The two-way set associativity gives the cache substantially 
better hit rates than direct-mapped caches at the 8-Kbyte size, 
and, due to the implementation techniques used, does not 
adversely affect clock cycle time. The cache has a one-clock- 
cycle access time and can provide a pair of instructions (64 
bits) to the instruction dispatch unit on each cycle, regardless 
of whether the access is aligned to an odd or an even word 
address. The only time the cache fails to deliver two instruc¬ 
tions is when the instruction pair straddles a cache-line bound¬ 
ary. In practice this turns out to have a very minor performance 
impact. 

We designed the cache to minimize latency on a miss. 
Figure 20 shows the hardware involved in a cache miss. On 
a miss, the processor initiates an eight-word burst transaction 
on the bus to fill the cache line. The burst can transfer two 
instructions from the bus into the cache on each clock cycle. 


The burst begins with the missed in¬ 
struction pair, continues transferring to 
the end of the cache line, and then 
wraps around to fill the beginning of 
the cache line (if necessary). As soon 
as the cache receives the missed instruc¬ 
tion from the bus, it forwards the in¬ 
struction directly to the instruction unit 
so that execution can resume immedi¬ 
ately. As the cache receives subsequent 
instructions from the bus, it also streams 
them directly into the instruction unit 
so that execution doesn’t stall while the 
cache line is being brought in from 
memory. 

As shown in Figure 21, the cache is 
arranged in eight 1-Kbyte blocks; each 
block is 64-bits wide by 128 rows. Each 
row contains one 32-bit word from each 
of the two cache lines. Four of the cache 
blocks hold the even words of a pair; 
the other four hold the odd words. 
When access is made to an evenly 
aligned instruction pair, the least sig¬ 
nificant word returns on the even bus 
and the most significant word returns 
on the odd bus. When access is made 
to an oddly aligned pair, the least sig¬ 
nificant word returns on the odd bus 
and the most significant on the even 
bus. The instruction sequencer swaps 
the two words before using them. 

It is the responsibility of software to maintain instruction 
cache coherency. Thus, when the virtual memory system swaps 
in a new page, the instruction cache entries may no longer be 
valid because a particular logical address may now map to a 
different physical address. The 88110 provides a fast (approxi¬ 
mately five clock cycles) cache-invalidate and cache-line- 
invalidate feature that enables supervisor routines to eliminate 
stale data from the cache. Instruction cache coherence with 
other caches in a multiprocessor system is not normally a prob¬ 
lem because processors do not frequently write into instruc¬ 
tion space. In fact, the 88110 hardware does not directly support 
self-modifying code—that is, a program that writes into the 
currently executing instruction stream. However, the operat¬ 
ing system does need to implement loaders, computed pro¬ 
grams, copying garbage collectors, and other such programs. 

Data cache. The data cache’s organization resembles that of 
the instruction cache’s. It is 8 Kbytes in size, two-way set asso¬ 
ciative, and has eight words in each line. It has a single-clock- 
cycle access time and can provide 64 bits each clock cycle to the 
load/store unit. The normal cache-write policy is “store-in” (write¬ 
back with write-allocate). And, as with the instruction cache, 
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burst line fills begin on the missed word 
and data is forwarded and streamed off 
the bus, through the load/store unit, and 
directly into the register files to minimize 
miss latency. 

We selected store-in policy because 
bus traffic is less for store-in caches than 
for store-through caches. 23 In store- 
through caches, the number of main 
memory references is never less than 
the store frequency regardless of cache 
size. 24 Considering that 20-30 percent 
of memory references are typically 
stores, this can be a problem. Store-in 
policy helps in multiprocessor systems, 
where bus utilization must be kept low 
for good system performance. 

On a load or store that hits the cache, 
the memory reference accesses the 
cache directly. On a miss, cache con¬ 
trol logic selects a line in the cache for 
replacement, using a pseudorandom se¬ 
lection algorithm that gives priority to 
invalid lines. If the line selected for re¬ 
placement has not been modified since being brought into the 
cache, the hardware simply brings in a new cache line from 
memory to overwrite it. If the selected line has been modified, 
it is first copied back to memory before the new line comes in 
to replace it. A store that hits the cache on a clean line, writes 
directly into the cache and also broadcasts a message to other 
processors on the bus to invalidate any copies of this cache 
line they may have in their local caches. 

The hardware automatically maintains data cache coher¬ 
ence. The data cache employs the four-state MESI cache co¬ 
herency protocol illustrated in Figure 22 on the next page. A 
write-invalidate procedure guarantees that only one proces¬ 
sor on the bus has a modified copy of any given cache line at 
the same time. 

The coherency protocol is enforced by bus snooping, 
whereby each processor watches (snoops) all bus transac¬ 
tions to track the proper state for each cache line. 25 For ex¬ 
ample, if a bus transaction occurs for a cache line that a 
processor happens to have in the modified state, it forces the 
originator of the transaction off the bus, copies the modified 
line back to memory, changes the state of its line to shared- 
unmodified, and then allows the original bus transaction to 
be retried. The cache maintains a separate set of address and 
state tags for snooping, so that bus snooping does not inter¬ 
fere with the processor’s access to its local cache. 

Although hardware fully maintains data cache coherence 
from a multiprocessor point of view, the operating system must 
still flush stale data out of the cache when the virtual memory 
map is altered. The data cache can be invalidated quickly on a 


line or entire-cache basis. The cache can be cleaned (copy- 
back of dirty lines) or flushed (copy-back of dirty lines with 
invalidation) on a line, page, or entire-cache basis. The operat¬ 
ing system activates invalidation and flushing operations by 
writing commands to cache control registers accessible only 
from supervisor mode. Invalidation operations are very fast, 
requiring approximately five clock cycles for either line or full- 
cache invalidations. Cleaning or flushing on a page or entire- 
cache basis requires one clock cycle for each cache set (each 
cache contains 128 sets) plus the memory transfer time needed 
to copy back any dirty lines. 

External bus interface 

The 88110 processor has a high instruction throughput and 
therefore generates a high rate of memory accesses. The on- 
chip caches provide relatively high hit rates and eliminate 
most off-chip memory accesses, but even so a substantial 
amount of external memory traffic can occur. To keep bus 
usage down to a point that a tightly coupled, dual-processor 
system is viable, we used a store-in (write-back) data cache 
policy to reduce bus traffic and also developed an efficient 
multiprocessor bus. 

The 320-Mbyte/s, synchronous, demultiplexed, pipelined 
bus supports a retry cache coherency snooping protocol and 
offers burst-mode and split-transaction transfers. A 64-bit data 
path minimizes data transfer time on the bus; burst-mode 
cache-line fills reduce transaction overhead; a split-transac¬ 
tion protocol allows other masters to use the bus while an¬ 
other waits for memory; and address pipelining allows memory 
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RH Read hit 
RMS Read miss, shared 
RME Read miss, exclusive 
WH Write hit 
WM Write miss 
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Figure 22. MESI cache coherency protocol. 

access time to overlap data transfer time. These features, along 
with the snoopy data cache, make simple, low-cost multipro¬ 
cessor systems practical. 

Bus arbitration, handshaking, and data transfers are all syn¬ 
chronous with the system clock and are referenced to a single 
clock edge. An internal, analog, phase-locked loop circuit 
minimizes skew between the internal clock cycle and exter¬ 
nal signals referenced to the system clock. This circuit greatly 
simplifies the problem of electrically interfacing to the chip at 
high speed. 

A centralized controller uses a simple bus-request/grant 
protocol to arbitrate bus ownership. The arbiter may “park” a 
processor on the bus to eliminate arbitration latency in the 
frequent cases that bus ownership does not change between 
successive transfers. 

The 32-bit address bus is separate from the 64-bit data bus 


to support address pipelining. Address 
pipelining allows the address phase of a 
bus transaction to run concurrently with 
a previous data transfer phase. In multi¬ 
processor configurations, address pipe¬ 
lining allows memory access times to 
overlap data transfer times, thereby in¬ 
creasing available bus bandwidth. 

In the past, most microprocessor buses 
used a tenured transaction protocol. A 
tenured protocol ties up the bus from the 
time a transaction starts until the entire 
memory cycle completes and data returns 
to the processor. In a DRAM system with 
relatively long access time, this protocol 
wastes considerable bus bandwidth. The 
88110, on the other hand, uses a split- 
transaction bus protocol, which allows a 
bus transaction to be split into distinct 
address and data phases that are con¬ 
trolled independently. 

For example, a processor can send an 
address request to the memory system 
and then permit other processors to use 
the bus while it waits for a response. This 
protocol uses the bus more efficiently, 
consuming bandwidth only during the 
time addresses or data actually transfer. 
Address pipelining and split transactions 
permit the 88110 to more closely ap¬ 
proach the theoretical bus bandwidth limit 
than microprocessors that use tenured bus 
protocols. 

All burst-mode transactions transfer an 
entire cache line. A burst transfer uses four 
data beats; each beat transfers 8 bytes of 
data. The system controls the length of time 
of each data beat, which can be as short as one clock cycle. 

The diagram in Figure 23 shows a possible sequence of bus 
transactions in a multiprocessor system (for clarity some con¬ 
trol signals have been omitted). On the first clock cycle shown, 
a processor (CPU A) is parked on the bus (BG A preasserted) 
and requests a data cache line fill by asserting Transfer Start 
(TS) and driving the address bus. Two clock cycles later, the 
memory acknowledges receipt of the address (AACK), and 
CPU A terminates its address phase. Meanwhile, a second pro¬ 
cessor (CPU B) has asserted Bus Request (BG B) for an in¬ 
struction cache line fill and has been scheduled next onto the 
bus by the arbiter’s assertion of Bus Grant (DBG B) to it. As 
soon as CPU A relinquishes the address bus, CPU B starts its 
cycle. In this example, CPU B's request gets serviced immedi¬ 
ately by memory, and the arbiter allows it to read data by 
granting it access to the data bus (DBG B asserted). The data 


58 IEEE Micro 















CPU A address 

CPU B address data 

CPU A data 


1 1 2 | 3 

4 1 5 | 6 | 7 | 8 

9 1 10 | 11 | 12 | 13 

14 


Clock 
A31-0 
BG A 
BGB 
TS 
AACK 
DBG A 
DBG B 
D63-D0 
TA 


cpu a y 


y 


{ CPU B > 


y 


- rrTBTBTB> 


V 


y 


- TaTXTaTXV : 


V 


Figure 23. Split bus transaction. 
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Figure 24. Retry bus-snooping protocol. 


transfer here is shown occurring at full 
bus speed; however, the memory sys¬ 
tem could use Transfer Acknowledge 
(TA) to pace the transfer by inserting 
wait states on each data beat. As soon 
as the transfer to CPU B finishes, the 
arbiter grants data bus to CPU A (DBG 
A) for its data transfer. 

The bus-snooping mechanism, illus¬ 
trated in Figure 24, enforces cache co¬ 
herency among all processors on the 
bus. When a processor (in this ex¬ 
ample, CPU A) that is currently the bus 
master puts a global (GBL) address out 
on the bus, all other processors on the 
bus snoop the address. If one of these 
processors (CPU B in this case) has a 
modified copy of the data being re¬ 
quested in its local cache, it signals that 
fact to the requesting processor (CPU 
A) via a snoop status signal (SSTAT B). 

Upon seeing the snoop hit signal, CPU 
A aborts its bus transaction and relin¬ 
quishes control. The snooping proces¬ 
sor (CPU B) then takes control of the 
bus and copies its modified line back 
out to memory. When this transaction 
is complete, CPU A retries the original 
transaction, which now completes nor¬ 
mally since all caches are consistent 
with memory. The control signals are 
flexible enough to support more so¬ 
phisticated protocols such as interven¬ 
tion with direct cache-to-cache transfer 
(snarfing). 6 

System features 

The 88110 includes several features 
designed to improve system debug¬ 
ging, reliability, and testability. 

One of the few drawbacks of on- 
chip caches is that they filter external 
memory references, which limits vis¬ 
ibility and reduces the utility of in-cir¬ 
cuit emulators for software debugging. One software debug 
issue, for example, is detecting the source of corruption of a 
particular program variable or data structure. The 88110 ad¬ 
dresses this problem by providing two data breakpoint regis¬ 
ters that allow a program to trap to a software debugger on 
an access to, or modification of, a specified logical address or 
range of addresses (byte, half, word, double, quad, ..., page). 
The 88110 also has a facility that allows a debugger to single- 
step a program one instruction at a time. 


Two features improve the 88110’s applicability in high- 
reliability systems: data bus parity and deterministic lockstep 
operation. On all external data bus write operations, the pro¬ 
cessor generates an odd parity bit for each byte transferred 
on the bus. On data bus read operations, the processor checks 
the parity bits and generates an interrupt if any byte transfers 
in en*or. For redundant systems, lockstep operation makes it 
possible to shadow one 88110 with another; once synchro¬ 
nized, two 88110s will stay in lockstep so long as they are 
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Figure 25. Photograph of 88110 die. 

presented with the same inputs at the same time. 

We improved the in-system testability of the 88110 by pro¬ 
viding JTAG/IEEE 1149-1 boundary scan logic on all relevant 
I/O pins. 

Silicon 

We designed the 88110 in a triple-level metal, double¬ 
level polysilicon CMOS process. We used l-|im design rules 
with transistor channel lengths reduced to an effective length 
of less than 0.8 pm. The die is easily shrinkable to 0.8-pm 
(0.65-pm effective channel length) technology without de¬ 
sign modifications. The complete design required less than 
1.3 million transistors and fits on a die 15 pm on a side 
(Figure 25). The cache SRAM cells are a four-transistor, NMOS, 
polyload, bit-cell design, which uses a P-well process for 
high immunity to soft errors. 

Initially, we plan to provide the 88110 in a through-hole, 
ceramic pin grid array package. The 299-pin, 20 x 20, cavity- 
down package measures approximately 2 inches on a side, 
with 100-mil pin pitch. 

Performance 

Official benchmark data from real systems and production 
compilers is not available at the time of this writing. How¬ 
ever, a good-quality prototype optimizing compiler (the Mo¬ 
torola 88110 Alpha complier) is available, as well as a 
clock-for-clock instruction simulator that accurately models 
all processor pipeline, primary cache, TLB, and memory sys¬ 


Table 1. Simulated SPEC ratios at 50 MHz. 

Benchmark 

Ratio 

Gcc 

46.5 

Espresso 

48.1 

LI 

57.0 

Eqntott 

52.9 

Spice2g6 

34.7 

Doduc 

41.4 

Nasa7 

67.9 

Matrix300 

357.8 

Fpppp 

64.4 

Tomcatv 

72.2 

Geometric means 


Integer 

51.0 

Float 

73.9 

Combined 

63.7 


tem effects. Results indicate a Dhrystone 2.1 performance 
that translates to well over 100 VAX MIPS. 

We also used the instruction simulator to run the SPEC 
benchmark suite at 50-MHz with a 180-ns (9/1/1/1) DRAM 
memory system. The results appear in Table 1. These bench¬ 
marks were compiled with the Motorola 88110 Alpha com¬ 
piler, except LI, which was compiled with the Diab 88110 
Compiler Version 2.37. The Nasa 7 and Matrix 300 bench¬ 
marks were preprocessed by the Kuck and Associates pre¬ 
processor. In a recent publication, Mike Phillip reports more 
completely on the 88110 compilers and performance. 26 

The instruction simulator does not yet accurately model all 
effects of external secondary-cache misses, so we haven’t 
reported results with a second-level cache here. However, 
simulations with an infinite secondary cache show a com¬ 
bined Specmark above 80. 

A significant characteristic of the 88110 is that it makes 
parallel instruction execution fairly easy to achieve in prac¬ 
tice. Relatively simple compilers can produce effective code 
schedules for the 88110; in fact, the processor realizes sub¬ 
stantial parallelism even on code originally generated for the 
88100 single-issue CPU. The efficiency of superscalar issue 
ranges from 20 percent to over 50 percent, depending on the 
benchmark and memory configuration. Currently, over the 
SPEC benchmark suite, we find that two instructions issue on 
roughly half (ranging from about 35-70 percent) of the clock 
cycles on which an instruction executes at all. Of course, we 
expect these results to continually improve with advances in 
our compilers. 

The increasing use of high-level languages makes good 
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Figure 26. Floating-point matrix multiply. 


performance on general compiled code essential. But many 
programs spend a great deal of time in a few critical routines. 
System response time can dramatically improve by tuning a 
few of these hot spots. DSP algorithms for voice processing, 
graphics library routines, video processing, and interactive 
user interface routines are prime candidates for this type of 
tuning. As an example, the double-precision floating-point 
matrix multiply routine often used in graphics viewpoint trans¬ 
formations is illustrated in Figure 26. On this code the 88110 
can issue two instructions on nearly every clock cycle and 
can sustain 97.5 MIPS and 68 double-precision Mflops (at 50 
MHz), even if the point vectors being transformed are not in 
the cache and the processor is operating into DRAM. 

Second-level cache 

Although the hit rate of the 88110’s internal caches is quite 
high, long DRAM latency and high bus utilization can still 
limit performance. For ultimate performance, or for system 
designs calling for more than two tightly coupled processors, 
we must further reduce memory access time and bus use. 

One obvious approach is to use a secondary cache local to 
each processor. 27 Motorola designers developed a fully inte¬ 


grated second-level cache, consisting of the 88410 cache con¬ 
troller and an array of 62110 cache SRAMs. We implemented 
the secondary-cache function as a separate chip, rather than 
putting the logic on the 88110 itself, so that low-end systems 
don’t have to pay for transistors they don’t use. 

The 88410 sits directly in the 88110 address bus path, as 
shown in Figure 27, and provides all secondary-cache con- 



Figure 27. Second-level cache. 
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trol functions and tags for 256 Kbytes to 1 Mbyte of cache. 
Cascaded 88410s can support larger cache sizes. The cache 
tags inclucied on the 88410 allow all hit, miss, and data-steering 
decisions to be made quickly without accessing off-chip 
SRAMs. This approach also reduces pin count and the num¬ 
ber of SRAM packages required, thereby minimizing the sys¬ 
tem cost. 

The SRAM cache array sits directly in the 88110 data bus 
path and the 88410 controls all data transfers into, out of, and 
through the array. The 62110 cache SRAM device works es¬ 
pecially well in an 88110/410 secondary-cache system. The 
62110, based on a standard 32K x 9, 12-ns SRAM, has a dual¬ 
bus architecture that allows data to be fed directly from the 
system bus onto the 88110 data bus and simultaneously cap¬ 
tured in the internal SRAM array. We plan to offer the 62110 
commercially as a cache part for other systems as well. 

The secondary cache implemented by the 88410 is a di¬ 
rect-mapped cache with a store-in (write-back) write policy. 
Line length is configurable to either 32 or 64 bytes. Cache 
hits, using the 62110 SRAM, present a 3/1/1/1 memory cycle 
to the 88110. 

A four-state, MESI protocol enforced by bus snooping main¬ 
tains horizontal coherency between the 88410 cache and other 
caches on the system bus. The 88410 uses inclusion 28 to main¬ 
tain vertical coherency between the 88110’s internal cache 
and the secondary-cache array. 

The bus protocol and electrical interface used by the 88410 
are similar to those used by the 88110. As a result, one can 
design a system that can accept either an 88110 or an 88110/ 
88410/62110 module. The 88410 also has an option to allow 
the system bus to operate at half the speed of the 88110 bus. 
This feature relaxes system timing constraints and will even¬ 
tually allow systems to accommodate higher frequency 88110/ 
410 modules, using standard TTL electrical interfaces. 


In designing the 88110 , our goal was to produce a 
high-performance, general-purpose microprocessor at a cost 
consistent with use in low-cost personal computers and work¬ 
stations. We accomplished this goal with an advanced super¬ 
scalar architecture and a high level of circuit integration 
implemented in a fine-geometry, high-yield, semiconductor 
fabrication process. (I! 
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The Proposed SSBLT Standard 
Doubles the VME64 Transfer Rate 


A revision to the IEEE 1014 VMEbus standard will offer a source synchronous block transfer 
protocol that doubles the transfer rate without changing the backplane or electrical interface. 
The faster rate in turn doubles the performance/cost ratio of the bus. 


Jack Regula 

Force Computers 


C n 1991, the IEEE P1014R (Revision D) 
working group drafted a new transfer 
mechanism for the 64-bit VMEbus: 1 the 
source synchronous block transfer 
(SSBLT). The working group gave preliminary ap¬ 
proval to the SSBLT protocol as described here; it 
is thus on its way to becoming an IEEE standard. 

With SSBLT, the source of the data supplies 
the clock used to sample the data at the destina¬ 
tion. Consequently, the working group applied 
the term source synchronous block, transfer to the 
protocol. SSBLT achieves higher performance by 
eliminating the protocol delays built into the origi¬ 
nal VMEbus specification. It is optimized by its 
source synchronous nature, which minimizes the 
skew between the data and the clock. 

SSBLT doubles the rate at which data transfers 
between masters and slaves. Operating over the 
64-bit VMEbus (as defined in the latest proposed 
draft, Rev. D), data transfers at 20M transfers per 
second times 8 bytes per transfer, for a burst trans¬ 
fer rate of 160 Mbytes/s. 

Significantly, this performance improvement 
results purely from protocol improvements. SSBLT 
allows transfers to make use of standard VMEbus 
backplanes and driver technology and permits 
systems employing SSBLT to be backward com¬ 


patible with present IEEE 1014 VMEbus modules. 

Progress 

VMEbus performance has increased in several 
steps since it was introduced 10 years ago. From 
the original maximum transfer rate of 10 to 20 
Mbytes/s without block transfers, VMEbus through¬ 
put increased to a peak of 30-40 Mbytes/s when 
using 32-bit block transfers. Block transfers raise 
performance levels because, after an initial access 
latency, many slaves can supply data at a higher 
rate. VMEbus handshaking and protocol delays 
limit this rate to something less than 10M transfers 
per second. 

Multiplexed block transfers (MBLTs) were pro¬ 
posed about three years ago and began reaching 
production status during 1991. MBLTs double 
block transfer performance by doubling the data 
path width. But, since they use the same proto¬ 
col and timing rules as block transfers, MBLTs 
are also limited to less than 10M transfers/s. 

MBLTs employ address/data multiplexing to 
double the data path width and, optionally, tire 
address width. During the first cycle of an MBLT, 
which is conveniently called the address phase, 
no data transfers over the bus. In addition, an¬ 
other 32 bits of address are multiplexed onto the 
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data lines whenever the A64 mode (64 bits of address) is in 
use. After the first DTACK signal assertion (DTACK*), both the 
address and data lines can be used for data. From this point 
on, the timing for MBLTs is the same as for 32-bit block trans¬ 
fers. Therefore, performance doubles with MBLTs. 

The 64-bit address capability added to VMEbus with MBLTs 
also significantly extended its useful life. And, to address the 
increased use of bus bridges in future systems, Revision D 
for SSBLTs adds a cycle retry function intended to allow dead¬ 
locks to be broken. All these enhancements are compatible 
with or variations of the original, asynchronous, four-edge, 
strobe-acknowledge VMEbus handshake. 


The key to the SSBLTs ability to 
double transfer rates is its 
elimination of several protocol 
delays included in the original 
VMEbus standard. 


The SSBLT mechanism goes beyond that of the MBLT by 
eliminating the strobe acknowledge handshake that limits 
the performance of the asynchronous protocol. Like the MBLT, 
SSBLT multiplexes address and data lines to form a 64-bit 
data path. The address phase is identical to that of the MBLT 
and can include either 32 or 64 bits of address. But in the 
data transfer portion, data is clocked from source to destina¬ 
tion without cycle-by-cycle handshaking at a rate of up to 20 
MHz or 160 Mbytes/s. 

SSBLTs contain unique address modifier codes: 07 indi¬ 
cates an A32 SSBLT, and 06 indicates an A64 SSBLT. Boards 
not capable of performing SSBLTs don’t respond to these 
address modifiers nor assert bus error signal BERR*. Thus the 
master can repeat the access with another transfer method 
such as a standard block transfer or an MBLT. This level of 
interoperability is assured by requiring an SSBLT board to 
support all earlier transfer methods. 

Transfer and throughput rates 

Because the cycle-by-cycle handshake has been eliminated, 
boards can relatively easily transfer data at the peak transfer 
rate. Contrast this with VMEbus asynchronous handshaking, 
which is hard pressed to approach 10 MHz and is slowed 
down by backplane and driver propagation delays. The ini¬ 
tial access latency amortized over the entire burst primarily 
determines data throughput for an SSBLT. 


I’ve estimated that, with back-to-back transfers of 64 bytes, 
the sustained transfer rate using SSBLT is 128 Mbytes/s for 
writes and 100 Mbytes/s for reads. Increasing the block size 
to 2 Kbytes boosts the estimated sustained rate to 159 and 
157 Mbytes/s for writes and reads. 

The sustained-rate calculations assume 100 ns for the ad¬ 
dress phase on a write transfer and 240 ns for reads, includ¬ 
ing initial access latency (typical of high-performance VMEbus 
interfaces). Because 8 bytes transfer every cycle, each 64- 
byte block requires eight transfers. At the SSBLT maximum 
rate, transfers execute every 50 ns. Therefore, the calcula¬ 
tions for back-to-back 64-byte blocks are 

Write transfers 

100 ns + 8 transfers x 50 ns = 500 ns/block 
64 bytes/500 ns = 128 Mbytes/s 
Read transfers 

240 ns + 8 transfers x 50 ns = 640 ns/block 
64 bytes/640 ns = 100 Mbytes/s 

When the block size increases to 2 Kbytes, requiring 256 
transfers to complete, the overhead of the address phase is 
amortized over a larger data transfer period. Thus, back-to- 
back transfers of the larger blocks yield 

Write transfers 

100 ns + 256 transfers x 50 ns = 12,900 ns/block 
2 Kbytes/12,900 ns = 159 Mbytes/s 
Read transfers 

240 ns + 256 transfers x 50 ns = 13,040 ns/block 
2 Kbytes/13,040 ns = 157 Mbytes/s 

The write latency in these calculations assumes that the 
packet can be received without delay at the beginning of the 
transfer’s data cycle. The calculations also assume that, for 
reads of small blocks, a FIFO queue buffers the transfers and 
is partially loaded before the start of a transfer. We estimate 
this step to require 240 ns. For large block transfers, the cal¬ 
culations assume the circuitry of the boards involved is fast 
enough to handle the transfer in either direction without 
throttling. 

Eliminating protocol delays 

The key to the SSBLT’s ability to double transfer rates is its 
elimination of several protocol delays included in the origi¬ 
nal VMEbus standard. These delays simplified implementa¬ 
tions by allowing architecturally simple interfaces to be 
implemented with logic that is both agonizingly slow and 
extremely modest in complexity by today’s standards. To¬ 
day, architectural elegance is affordable, as is subnanosecond 
logic. 

Using the high-speed, high-density ASIC technologies avail¬ 
able now, single-chip interfaces—including FIFO buffers and 
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SSBLT proposal 


Address/data at master 
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Address/data at slave 


D(n) 


Figure 1. VMEbus protocol delays. 

DMA controllers employing a 20-MHz SSBLT protocol—are 
well within the state of the art. Bus interface ASICs need only 
a few hundred additional gates to add SSBLT capability to a 
design that already includes MBLTs. 

Protocol design for TTL backplanes is complicated by the 
considerations of incident wave switching. Incident wave 
switching 2 in a transmission line environment refers to a 
driver’s ability to drive its output voltage across the switching 
threshold of receivers placed along the line as the voltage 
wavefront first propagates down the transmission line. A 
driver’s incident wave output voltage is reduced by voltage 
division of its output impedance with the transmission line’s 
impedance. When the driver isn’t strong enough for incident 
wave switching, the switching threshold isn’t crossed until a 
reinforcing reflection arrives from the far end of the transmis¬ 
sion line. The resulting waveform then takes on a stairstep 


appearance with one step per reflection. 
In VMEbuses, a single step often appears 
near the threshold region. 

TTL backplanes generally do not pro¬ 
vide incident wave switching unless they 
are only lightly loaded. Protocol design¬ 
ers must take into account the possibil¬ 
ity that certain signals, such as data 
strobes, might be received with incident 
wave switching, while transitions of the 
data itself might not be seen until a re¬ 
flection arrives from the far end of the 
backplane. The original VMEbus stan¬ 
dard provided for this situation by re¬ 
quiring the master to provide a 35-ns 
setup time while guaranteeing the slave 
only 10 ns of setup. The difference is two backplane delays 
totaling 15 ns plus an additional allowance of 10 ns for skew 
in the bus drivers and receivers. The two backplane delays 
allow time for the reinforcing reflection to arrive from the far 
end(s) of the backplane. The SSBLT protocol’s data capture 
delay parameter permits the same effects. At 37.5 ns it is 
actually slightly more conservative than the original VMEbus 
protocol! Figure 1 illustrates the VMEbus protocol delays. 

VMEbus has two additional protocol delays. The slave may 
not assert DTACK* until at least 30 ns has elapsed since asser¬ 
tion of DS[1..0]*. Although not shown in Figure 1, the master 
cannot capture read data from the bus until 25 ns after the 
assertion of DTACK* because of possible skew between data 
and DTACK*. These protocol delays mean that even with infi¬ 
nite speed logic and zero-delay backplanes, a compliant 
VMEbus data transfer cycle takes a minimum of 70 ns for writes 
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Figure 2. VME64 source-synchronous block transfer. If AS* rises before data capture time, data does not transfer and the 
cycle ends. The slave can sample AS* at what would have been the data capture time and verify the burst end. DS1* stays 
asserted throughout the burst to keep the bus timer enabled. 
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Figure 3. Asynchronous state machine for SSBLT master reads and slave writes. SSBLTC indicates master_read and DTACK* 
or slave_write and DSO*. 


and 95 ns for reads. When practical logic, driver, receiver, and 
backplane delays are added to these protocol minimums, 
VMEbus users find it extremely difficult to achieve burst trans¬ 
fer rates of greater than 64 Mbytes/s, even with the MBLT 
method. The significance of the SSBLT method is that it makes 
it relatively easy to achieve burst rates of 160 Mbytes/s and to 
sustain throughputs of over 100 Mbytes/s. 

Figure 2 shows both read and write SSBLTs. These contain 
an address phase in which the slave asserts DTACK* as soon 
as it recognizes the address and address modifiers and is 
ready to transfer data. In the write cycle, the master then uses 
DSO* to clock the data to the slave at a 20-MHz rate (or slower, 
if desired). In the read transfer, an additional data strobe 
pulse provides buffer turnaround time. Then the slave, which 
is the source on a read cycle, uses DTACK* to clock the data 
to the master. 

Figure 3 shows a small asynchronous state machine that 
may be used by an SSBLT master to receive data on a read 
cycle or by an SSBLT slave to receive 
data on a write cycle. Note that in Fig¬ 
ure 2 each edge of DSO* or DTACK* 
transfers data. The SSBLT protocol 
specifies that at least 50 ns must occur 
between each edge. The data destina¬ 
tion detects each edge, delays a data 
capmre time, then latches the data from 
the VMEbus. Data is nominally in phase 
with the clock at the source (±5 ns). 

The data capture delay, which must be 
between 37.5 and 45 ns, allows for 
nonincident wave switching, 5 ns of 
skew at each source, and destination 
and settling times. 

Figure 4 illustrates SSBLT protocol 


delays. This protocol simply sets the instantaneous transfer 
rate to the maximum that can be supported (with a margin 
for safety) and provides the timing rules for data transfer. In 
contrast with a VMEbus asynchronous handshake, it does 
not include cycle-by-cycle handshaking delays. Such delays 
make performance depend upon the physical length of the 
backplane and the speed of the backplane drivers and inter¬ 
face logic. SSBLT masters and slaves need only keep their 
skew and capture delay errors within budget to be able to 
use the maximum transfer rate. 

Rescinding DTACK* driver 

A rescinding three-state driver is a circuit that is actively 
driven high and then tristated (changes voltage to high, low, 
and off states). When used for DTACK*, a rescinding driver 
speeds up asynchronous block transfers. In a heavily loaded 
VMEbus backplane, the time constant of the terminators and 
the distributed capacitance of the bus increases the propaga- 


Data at master 


Data at slave 


SSBLT_SRC„CLK 


SSBLT_DST_CLK 



Figure 4. SSBLT protocol delays. 
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tion delay for the rising edge of DTACK*. The resulting delay 
is greater than the VMEbus 40-ns minimum strobe, high-time 
specification. This delays the start of the next cycle. 

The SSBLT revision to the VMEbus standard provides the 
timing rules for use of a rescinding DTACK* driver. If a slave is 
using a tristate driver for DTACK*, it can enable its driver upon 
selection. It may not drive DTACK* low until 30 ns after DS11..0] 
assertion and must drive DTACK* high within 25 ns of ail 
strobes high (AS*. DSO*. and DS1* = 1). DTACK* must be tristated 
within 50 ns of all strobes becoming high—20 ns before the 
next selected slave is permitted to drive DTACK* low. 


The SSBLT revision to the 
VMEbus standard provides the 
timing rules for use of a 
rescinding DTACK* driver. 


VMEbus protocol does not allow multiple slave transac¬ 
tions that might result in slaves with three-state DTACK* driv¬ 
ers attempting to drive DTACK* to opposite levels. The timing 
rules provide a period of time in which both the previous 
and newly selected slaves may drive DTACK* high; however, 
this is not a problem. The only possibility for compatibility 
problems due to use of a three-state driver for DTACK* exists 
when the slave is participating in a proprietary broadcast 
scheme. In such a case, the slave’s DTACK* driver can and 
should be controlled so as to emulate an open-collector output. 

The VMEbus standard already specifies a high-current, three- 
state driver for DSO*; SSBLT adds that requirement for DTACK* 
since DTACK* must function like DSO* for SSBLT read cycles. 
Standard 48-mA drivers support data lines, while 64-mA driv¬ 
ers support DSO*, DTACK*, and other control signals. 

Transfer length, burst termination 

The SSBLT transfer mechanism permits from one to 256 
transfers in one block, based upon the requirement that the 
burst ends at the first 2-Kbyte boundary. The burst can con¬ 
tinue only after another address phase, which appears on the 
VMEbus as a second SSBLT. This arrangement limits the size 
of the address counter required at the slave and means that 
boards that are not involved in a transfer don’t have to incre¬ 
ment their address counters during it. 

The master terminates an SSBLT by driving AS* high. For 
both reads and writes, if AS* changes to high before data 
capture time, data cannot transfer, and the cycle ends. By 
sampling AS* at what would have been the data capture time, 


the slave can determine that the burst has ended. If AS* is 
high, the cycle ends, and no data transfer becomes associ¬ 
ated with the previous strobe edge. To keep the bus timer 
enabled and thus prevent specious error indications, DS1* is 
asserted throughout a burst. 

Throttling 

The SSBLT provides interblock throttling as a packet-level 
mechanism corresponding to cycle-by-cycle handshaking. 
Ideally, an SSBLT slave asserts DTACK* during an address 
phase; it signals its ability to accept/provide a burst at the full 
transfer rate. Subsequently, it needs to throttle only infre¬ 
quently and momentarily. An intrablock throttling mecha¬ 
nism answers this need. 

Some applications, such as digital imaging systems, involve 
large block transfers. It can be necessary to suspend these 
momentarily to allow a competing transfer to take place on 
the local bus of the master or slave. This is an example of an 
appropriate use of intrablock throttling. 

To delay another block transfer, the destination (slave) can 
make use of two options. During the address phase, it can 
simply fail to respond until it is ready, or it can assert both 
RETRY* and BERR*. The source (master) then terminates the 
cycle before any data transfers, releases the bus, and waits 
before attempting another transfer. This is interblock throt¬ 
tling. Note that the RETRY* protocol specifies a bus release 
that usually results in other VMEbus traffic before the retry 
takes place. 

SSBLT also provides a method for intrablock throttling. 
This alternative activates when the destination’s input buffer 
is almost full, yet the burst is not over. Intrablock throttling 
allows the destination to suspend the data transfer until it 
can catch up. System designers should arrange that intrablock 
throttling is required only infrequently. 

To employ intrablock throttling during a write, the slave 
drives DTACK* to a high level. When the master detects 
DTACK* as high, it simply suspends the transfer until DTACK* 
becomes low again. Figure 5 shows intrablock throttling for 
the slowest case in which the master doesn’t suspend trans¬ 
fers until it has driven the third data and strobe edges after 
DTACK* deassertion. A faster responding source might also 
have paused with either D[63..0](4) or D[63-.01(5) valid on 
the bus; destination devices must be able to deal with any of 
these possibilities. 

During a read, the master temporarily stops the slave from 
transmitting data by driving DSO* high. When the master drives 
DSO* low again, the slave continues the transfer. Figure 6 
shows the waveforms for intrablock throttling on a read. The 
slave's response to the deassertion of DSO* on a read is analo¬ 
gous to the master’s response to DTACK* on a write. The 
same possible stopping points of one, two, or three transfers 
past strobe deassertion exist as in the write case. 

In intrablock throttling for both reads and writes, the source 
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Figure 5. Write intrablock throttling. Throttling protocol rule: Source must freeze its data and clock output within 100 ns 
of detecting DTACK* high. Since a 30-ns round-trip path delay may exist between source and destination, the destination 
should throttle when its queue is within 3 locations of the overflow point. 
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Figure 6. Read intrablock throttling. 


must freeze its data and clock outputs within 100 ns of de¬ 
tecting the throttle signal. Note, though, that the timing for 
the deassertion of DTACK* or DS0* is not fixed; no specific 
time reference drives either signal. Similarly, no specific time 
reference determines when the sender of the data must sample 
DTACK* or DS0*. The SSBLT specification only requires the 
data sender to halt data transfers within 100 ns of detecting 
either signal as high. 

In Figure 7 (on the next page) two unspecified timings 
relate to throttling and indicate the added timing consequences 
of backplane delay. The timing values in this figure also ap¬ 
ply to error handling (more on this later). 

After asserting the intrablock throttling signal, the destina¬ 
tion must be able to accept as many as two additional trans¬ 
fers before the source stops transferring data. Because the 
round-trip backplane delay between the source and destina¬ 
tion might be as great as 30 ns, the destination should throttle 


bursts when its queue is within three transfers of the over¬ 
flow point. 

Throttling can be misused as a slow form of cycle-by-cycle 
handshaking. VMEbus Rev. D will recommend that interface 
designs not only use throttling infrequently but also only for 
short periods. The spirit of the revision’s SSBLT protocol is 
that data be burst over the bus at 20 MHz with suspensions 
only for infrequent, exceptional conditions such as needed 
for a DRAM refresh. 

Conservative timing 

The timing for SSBLT is based upon the same backplane 
and driver characteristics as the original VMEbus specifica¬ 
tion. This standard provided reliable data transfer with up to 
25 ns of skew between data and the data strobe or DTACK*. 
This skew time includes two 7.5-ns backplane propagation 
delays plus 10 ns of driver/receiver skew. The two back- 
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VME.DSO* at master/source 


VME>DSO* at slave/destination 


DTACK*, BERR* at slave 


DTACK*, BERR* at master 


Master samples DTACK*, BERR* 


15 ns! 


-►! Unspecitied 


T 5 ns; 
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Figure 7. Error and throttling timing. The master must terminate or suspend 
transfers within 100 ns of detecting a low on BERR* or a high on DTACK*. The 
points at which the slave asserts BERR* or deasserts DTACK*, or at which the 
master samples them, are not specified. Note that RETRY* assertion is permitted 
only on the address phase. 


Table 1. Transmission timings. 

Characteristic 

Timing (ns) 

Data clock skew 


(two backplane propagation delays) 

15.0 

Data settling time 

10.0 

Driver/receiver skew 

10.0 

Receiver setup time 

2.5 

Minimum capture delay 

37.5 

Time quantization error 


(half period of ASIC clock) 

5.0 

Minimum hold time 

2.5 

Minimum transmit period 

45.0 


plane propagation delays appear in the potential skew be¬ 
cause the control signal edges (which are driven with 64- 
rnA drivers) might be seen with incident wave switching. 
The data edges will generally be detected only after their 
first reflection from the far end of the backplane. 

The worst-case minimums for source synchronous cap¬ 
ture delay and the overall transmission period take into ac¬ 
count the same skew, settling times, setup/hold times, plus 
a time quantization error allowing this delay to be generated 
synchronously. The transmission period determines the mini¬ 
mum time that can be allowed between data strobe edges. 
Table 1 lists these times, in nanoseconds. 

To provide an extra margin, the SSBLT protocol adds an 


extra 5 ns to the minimum transmis¬ 
sion period for a specified transmis¬ 
sion period of 50 ns. See Figure 4 again 
for the SSBLT protocol delays and 
stable data window. 

AM codes 

SSBLT employs two new address 
modifier codes that were previously 
undefined in the original VMEbus stan¬ 
dard. They are 0x06 for A64 SSBLT and 
0x07 for A32 SSBLT. 

Terminating a transfer 

Masters can terminate transfers when 
the required data has been sent or re¬ 
ceived. During a write, when the mas¬ 
ter has transmitted the required data, 
the master stops strobing the DS0* line 
and drives AS*, DS1*, and DS0* high. 
The slave responds by driving DTACK* 
high and thus terminating the transfer. 

In a read, when the master does not need more data, it 
terminates the burst by driving DS0*, DS1*, and AS* high. The 
slave then terminates the transfer within two strobe edges. 
Figure 2 (shown earlier) illustrates normal terminations for 
both the read and write cycles. The delays between master 
and slave result in a “dummy” data cycle being driven onto 
the bus by the slave after the master terminates the read. The 
master keeps its AS* signal asserted past the end of the dummy 
cycle to avoid driver conflicts with the next master. 

Either masters or slaves can tenninate transfers after de¬ 
tecting an error. If a slave detects an error of any kind during 
a write, it terminates the transfer by asserting BERR’ low. The 
master then terminates the transfer within two strobe edges. 
Figure 8 displays a write-error termination. 

If a master detects an error of any kind during a read, it 
terminates the burst by driving DS0*, DS1*, and AS* high. If 
the slave detects any errors, it also terminates the read by 
ceasing to toggle DTACK* and asserting BERR* low. In the 
latter case, the master aborts the transfer. Figure 9 illustrates 
read-error termination. 


VMEbus WAS ORIGINALLY CONCEIVED as a combination 
processor-memory-I/O bus. 3 Since its inception, processor 
speeds have increased by a factor of over 100, forcing archi¬ 
tects to remove most CPU-memory traffic from the bus. De¬ 
spite this, the demand for higher bus speeds continues to 
increase because of the need to support interprocessor com- 
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Figure 8. Write with BERR* termination. 


cost ratio of the VMEbus, widening its market 
and extending its life quite significantly. By 
adding performance headroom to the domi¬ 
nant 32-bit and now 64-bit backplane bus stan¬ 
dard, SSBLT provides increased assurance that 
the VMEbus will continue to meet the needs 
of its users. (B 
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munications at increasing rates and to support higher perfor¬ 
mance I/O for imaging and graphics, mass storage interfaces, 
and network communications. 

The required bus performance is not a simple function of 
processor speed. Rather, it is strongly dependent on system 
architecture and application. The decision to use a particu¬ 
lar processor and a particular backplane bus for a particular 
application incorporates many components other than bus 
performance. Preeminent among these are cost/performance 
and risk management. By doubling performance with only 
a marginal increase in cost, SSBLT doubles the performance/ 
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B ecently, Dante Del Corso, our editor-in- 
chief, sent me a note suggesting that 
there was a growing interest in PCMCIA, 
a 68-pin interface for memory cards used in note¬ 
book computers. (PCMCIA is the Personal Com¬ 
puter Memory Card International Association.) 
Dante felt that, although it isn’t currently cov¬ 
ered by any IEEE or ANSI organization, the de 
facto standard is having a major impact on the 
industry. 

Agreeing with Dante, is I. Dal Allan, principal 
consultant at ENDL (Saratoga, California) and a 
recognized industry expert on interfaces. Allan 
concedes that the interface has grown from a 
convenient method of adding slim (3.3-mm thick) 
memory cards to notebook computers. 

Interest in the 68-pin interface seems to be 
growing simply due to the critical mass of ven¬ 
dors developing notebook and laptop comput¬ 
ers. But memory cards aren’t the only thing 
PCMCIA supports. Designers are preparing to 
use the interface for disk drives as well. More¬ 
over, with emerging 1.8-in. winchester disk drives 
designers see a good market for add-ins for the 
portable computers. The emerging interface 
promises to provide superior interchangeability 
and better reliability over the long haul. It is rated 
at more than 10,000 insertions and ensures com¬ 
patibility over multiple vendors. 

PCMCIA defines three types of interfaces. Type 
I defines the interface for the 3.3-mm memory 
card. 

Type II allows the specification to accommo¬ 
date storage devices that can’t fit the 3.3-mm 
height constraint. Though the typical height is 3 
mm, Type II maintains compatibility to the base 
standard by using a 3-mm-wide rail along the 
edges and a 10-mm-deep mating area, both of 
which are kept at the standard 3-3 mm. The 


upshot is that designers won’t have to rework 
slots or cases to manage the larger card. 

Still in the proposal stage is Type III, which is 
supposed to define the interconnection scheme 
for LANs (local area networks) and modem cards. 
This definition describes a 50-mm body exten¬ 
sion and an 11-mm height. This cavity size seems 
large enough for the 1.8-in. disk drive form fac¬ 
tors. The driving factor is the height since manu¬ 
facturers want to stay within the 0.5-in. thickness 
for notebooks. Palmtop computers, though, may 
pose a different set of problems. 

If you are looking for a quick solution and 
availability for PCMCIA extended products, don’t 
hold your breath. Members of the committee are 
still wrestling with sizing. For example, three 
Type I cards can’t fit into the Type III cavity, and 
that is some concern. Additionally, pin size and 
orientation haven’t been worked out. 

Among the other issues facing the PCMCIA 
specification writers are the number of insertions. 
Although the specification claims more that 
10,000 insertions, it is unclear what the real num¬ 
ber is. It may be necessary to devise a new seat¬ 
ing and release system to minimize wear and 
ensure proper electrical contacts. Furthermore, 
PCMCIA has problems with the disk storage ca¬ 
pacity for small drives. Consequently, some 
people talk about providing compression as part 
of the basic input/output system (BIOS). More 
than likely, PCMCIA vendors of storage devices 
will provide a fully integrated card including drive 
and BIOS with compression capability built in. 
No doubt Microsoft Corp. (Bellevue, Washing¬ 
ton) will want to get its two cents in with a ROM 
version of its DOS (disk operating system). 
Whether the Type III cavity can accommodate a 
full-featured card remains to be seen. 

Industry observers such as ENDL’s Allan and 
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Richard Steincross from RMS Labs 
(Long Beach, California) wonder about 
the cost when compared to other al¬ 
ternatives. Allan points out that the 
X3T9 group has discussed porting SCSI 
(the small computer systems interface) 
to the PCMCIA world but to date has 
found no takers. The solution seems 
valid, and protocol chips exist that 
would support virtually any peripheral. 

Even with lack of support from the 
SCSI community for PCMCIA, Milpitas, 
California-based Adaptec Inc. is con¬ 
sidering a version of its 8000-series in¬ 
tegrated disk controller. “That may help 
on the cost angle,” suggests Steincross. 
He, however, expressed surprise that 
the emphasis isn’t on the IDE (inte¬ 
grated drive electronics) specification. 
Steincross explains that IDE caught on 
quickly because it “... was cheap, easy 
to integrate and heavily supported by 
the industry. I don’t see PCMCIA en¬ 
joying as much interest.” 

Though the jury may not be com¬ 
pletely in on PCMCIA, proponents 
point out the industry infrastructure is 
growing. Getting on the PCMCIA band¬ 
wagon isn’t necessarily inexpensive 
however. Executive and associate 
memberships carry fees of $10,000 and 
$2,500 a year. An executive member¬ 
ship buys you nine board seats, while 
an associate is allowed five seats. You 
can sign up as an affiliate, which al¬ 
lows you to attend meetings, observe, 
and receive documentation but not 
participate in discussions. If you are 
interested in membership or obtaining 
the latest document, call (408) 720- 
0107. 
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Game Genie: copyrights and add-ons 


B sers often modify computer programs to 
enhance their utility. Sometimes they buy 
add-on programs to increase perfor¬ 
mance. These add-ons provide interfaces that are 
easier to learn and remember, increase speed, 
add new functions, perform additional tasks, and 
otherwise interact with a preexisting program to 
provide additional utility. Typically, add-on pro¬ 
grams do not permanently modify the underly¬ 
ing program, do not make a tangible copy of a 
new, modified program, and cannot be used 
unless the customer has already purchased a copy 
of the underlying program 

Often, owners of copyrights in underlying pro¬ 
grams do not object to add-on programs. The 
add-ons add to the utility of, and increase the 
demand for, the underlying programs. There are 
many circumstances, however, in which copy¬ 
right owners might be displeased with—and 
therefore want to suppress—an add-on program. 

Consider the case of a low-power version of a 
program sold cheaply for the low-price market 
and a high-power version sold upmarket. What 
if an add-on cheaply converts the down-market 
model to perform the tasks of the upmarket 
model? (See the box on a similar case.) Some¬ 
times, an add-on program shows users the inad¬ 
equacies of the underlying program by providing 
improvements that the copyright owner has re¬ 
fused to be bothered to make. That may pave 
the way for customers to migrate to another prod¬ 
uct. This appears to have occurred in the case of 
database management add-on programs, which 
were eventually followed by competing programs 
that included the features of the add-on programs, 
as well as still additional improvements. 

I am not aware of any litigation over add-on 
programs of the types just described. However, 
a recent decision from the San Francisco federal 


trial court comes close. In Lewis Galoob Toys, 
Inc. v. Nintendo of America, Inc., 1 the court re¬ 
jected a claim that copyright owners have the 
exclusive right to determine whether end users 
may temporarily modify computer program-re¬ 
lated works once they are in the users’ hands. 
The underlying work in this case was a video 
game operated by a computer program. But the 
same legal conclusions would appear to apply 
with equal force to any other computer program- 
related work, such as a spreadsheet, database, 
or word processing program, in a consumer 
user’s hands. 

Background. Nintendo, a major Japanese and 
American seller of home and arcade video game 
equipment, owns copyrights in many popular 
video games. It markets these games in the home 
video field by selling cartridges that connect into 
its Nintendo Entertainment System (NES) game 
consoles. These are small special-purpose mi¬ 
crocomputers that provide a video display on a 
home television set. The cartridges fit into the 
consoles as cassettes fit into audio tape and vid¬ 
eotape players. 

A typical video game, such as Super Mario 
Brothers , features a protagonist (Mario) whom 
the player moves across the display screen. The 
computer system displays obstacles and enemies 
which Mario must overcome. The player pushes 
buttons and manipulates controls to cause Mario 
to jump over obstacles, evade dangers, kill en¬ 
emies, and traverse a series of “worlds,” at the 
end of which he may rescue a princess from an 
ogre. Programmed-in constraints limit Mario’s 
abilities. He can jump only so high or so far. He 
has only so many missiles to hurl at enemies. 
His speed is limited. To the extent that the player’s 
abilities are insufficient, given the programmed- 
in constraints, to overcome the dangers that Mario 
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faces, he succumbs and dies. After a 
set number of deaths, the player loses 
and the game ends. 

Galoob markets an accessory, the 
Game Genie Video Game Enhancer. 
Game Genie fits between a video game 
cartridge and the NES console. It modi¬ 
fies electronic signals passing between 
the console and cartridge by tempo¬ 
rarily inserting code segments, or 
patches , into the computer program as 
it appears in temporary memory (RAM). 
Game Genie thus allows a user to 
change the constraints, for example, to 
make Mario jump higher, run faster, 
and hurl more missiles at adversaries. 
It can also allow Mario more deaths 
before the player loses. Game Genie 
similarly modifies the play of other NES- 
compatible video games. 

Nintendo contended that Galoob 
was causing its customers to create 
derivative-work versions of the video 
game, in violation of Nintendo’s copy¬ 
right. Section 106(2) of the Copyright 
Act gives a copyright owner the exclu¬ 
sive right to prepare derivative works 
based on a copyrighted work. Section 
101 of the Copyright Act defines a de¬ 
rivative work as a work based on a 
preexisting work, such as translation, 
dramatization, motion picture version, 
art reproduction, condensation, “or any 
other form in which a work can be 
recast, transformed, or adapted.” 

Nintendo claimed that by modifying 
the program to change the rules of the 
game Galoob was making a change in 
Nintendo’s copyrighted work that 
amounted to preparation of a deriva¬ 
tive work. The copyrighted works in a 
video game, according to Copyright 
Office practice, include a computer 
program (literary work) and a visual 
display (audiovisual work). Since Game 
Genie changed the program by put¬ 
ting patches into the code, and since 
that changed the visual display, Galoob 
caused users to make unauthorized, 
derivative-work computer programs 
and displays. 

Nintendo also markets (or licenses 
others to market) devices that modify 


its video games in various ways, such 
as speeding up part of a game or skip¬ 
ping stages. But Nintendo maintained 
that its copyright gives it the exclusive 
right to traffic in such modifications. 
Since Galoob caused its customers to 
trespass on Nintendo’s claimed exclu¬ 
sive right to modify the game play, 
Nintendo accused Galoob of contribu¬ 
tory' infringement- meaning, contrib¬ 
uting to, or causing customers to 
engage in, copyright infringement. 

Galoob denied that the modifications 
created a derivative work. It also as¬ 
serted the affirmative defense that per¬ 
sonal game modification by end users 
for their personal enjoyment of games 
they had purchased was a fair use and 
was therefore privileged. The trial court 
agreed with Galoob on both counts, 
and an appeal is pending. 

Is there a derivative work? Game 
Genie does not make a physical copy 
of the computer code stored in a 
Nintendo cartridge. Its electronics and 
code patches merely interact with those 
of the NES console and the game car¬ 
tridge, modifying signals to change the 
video display and the results of game 
action. 

For example, a user might set Game 
Genie to continue the game until Mario 
dies six times, rather than three. In ef¬ 
fect, Game Genie substitutes its instruc¬ 
tions and data for parts of the original 
program. (For example, “do so-and-so 


for i = 1 to 3” becomes “do so-and-so 
for i = 1 to 6.”) 

But this change occurs only tempo¬ 
rarily, without rewriting the ROM in the 
Nintendo cartridge. Game Genie is like 
Maxwell’s Demon. It sits between the 
copyrighted computer program in the 
cartridge and the NES console’s CPU 
and censors the messages that go back 
and forth. Since it does not write any¬ 
thing down in a fixed form, it does not 
make a tangible, more-than-transitory 
copy of any of Nintendo’s computer 
program or visual display. The modi¬ 
fied code exists only in RAM, and the 
modified display appears only tempo¬ 
rarily on the TV screen. 

However, section 106(2) of the Copy¬ 
right Act does not require that one 
make a permanent copy. It gives copy¬ 
right owners an exclusive right to pre¬ 
pare derivative works in tangible or 
intangible form. Thus a stage perfor¬ 
mance of the musical Cats without 
authorization from T.S. Eliot would in¬ 
fringe the copyright in his book of 
poems, Old Possum s Book of Practi¬ 
cal Cats , even if no written script was 
reproduced. Therefore, Game Genie’s 
program modifications apparently pre¬ 
pare a derivative work, in terms of both 
the program and the audiovisual work 
whose display the program causes. 

Moreover, one federal appellate 
court has already held very similar con¬ 
duct to be infringing preparation of a 


Third-party upgrade 


In Hubco Data Prods. Co. v. Man¬ 
agement Assistance, Inc., 219 USPQ 
450 (D. Idaho 1983), Hubco, the 
copyright owner, sold different ver¬ 
sions of its computer program de¬ 
signed to serve computer systems 
having different amounts of memory. 
Hubco charged a higher price as the 
amount of memory handled in¬ 
creased. Hubco also sold upgrade 
services. MAI, the infringer, engaged 


in a competing upgrade service (not 
the sale of an add-on program as 
such), by modifying the code of in¬ 
stalled Hubco programs to make 
them serve more memory. The court 
based its decision on legal theory 
that MAI infringed by reproducing 
copies of the copyrighted program 
in the course of decompilation and 
study undertaken to learn how to 
make appropriate upgrades. 
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derivative work. In Midway Manufac¬ 
turing Co. v. Artie International, 2 the 
court found that use of speedup kits to 
change the operation of arcade video 
games violated section 106(2). The 
speedup kit made the game harder to 
play, to be sure, while the Game Ge¬ 
nie makes the game easier, but that 
merely reflects different user purposes 
being served. 

In the speedup case, arcade propri¬ 
etors wanted to make the game harder 
because customers found it so easy that 
they either lost interest or lingered over 
it for an interminable time without in¬ 
serting additional coins. The speedup 
kit therefore improved the revenue that 
arcade owners could earn from video 
game equipment they had purchased 
to earn revenue. 

In the present case, some users find 
the game too hard to enjoy playing. 
They therefore want to improve their 
enjoyment from the game which they 
(or their parent) purchased to provide 
them with home entertainment. 
Whether the modification makes the 
game harder or easier does not mean¬ 
ingfully change the fact pattern of the 
Artie case from that of the Galoob case. 
The key fact is that the alleged 
infringer’s conduct alters the code and 
display. 

No fixed copy. The Galoob court 
made a point of the lack of a tangible 
copy of the modified game program. It 
noted that the modified version would 
not be transferable to a third person 
because there was no copy. But that 
would be relevant only if some other 
section—not section 106(2)—were in¬ 
volved. Section 106(1) prohibits mak¬ 
ing unauthorized copies. Section 106(3) 
prohibits transferring unauthorized cop¬ 
ies to others. But this case did not al¬ 
lege a copyright infringement under 
those sections of the Copyright Act. 

The court sought to support its rul¬ 
ing by the statutory wording of the 
definition of derivative work, which 
includes the phrase “or any other form 
in which a work may be recast, trans¬ 
formed, or adapted.” It said that “form” 


implied fixed and tangible form. The 
court was not made aware of the legis¬ 
lative history of the definition, which 
is contrary to the court’s theory. 


The statute is a 
mess. But it is not 
the trial court's 
job to rewrite it. 


The House Report accompanying the 
1976 Copyright Act points out that the 
omission in section 106(2) of any re¬ 
quirement of fixation in a tangible form 
was intentional. Section 106’s anti¬ 
reproduction and antidistribution 
clauses contain fixation-in-copy re¬ 
quirements, but that requirement was 
deliberately omitted from section 
106(2)’s provision against preparation 
of a derivative work—albeit for rather 
frivolous reasons. 

The report explained that the forms 
of some copyrightable works lend 
themselves to preparation of derivative 
works in impermanent or intangible 
form. Yet they deserved protection 
against unauthorized takings, which 
should therefore be defined as infringe¬ 
ments. Congress cited pantomime and 
ballet as examples illustrating the 
claimed need to eliminate the fixed- 
copy requirement for infringement by 
making derivative works. To save pan¬ 
tomime from piratical derivative works, 
therefore, Congress made preparation 
of derivative-work versions of panto¬ 
mimes—and all other works—a copy¬ 
right infringement, regardless of 
whether a tangible copy was made. 
Indeed, Congress did not even require 
as a condition of infringement liability 
that anything be done with the unau¬ 
thorized derivative work. 

Congress may have made an unwise 
or even foolish decision. It should have 


limited liability for preparing derivative 
works in intangible form to panto¬ 
mimes and similar works, so the statu¬ 
tory remedy would not sweep up 
conduct unrelated to its legislative con¬ 
cern. At least Congress should have re¬ 
quired some kind of use after 
preparation before liability attached. 
The statute is a mess. But that does 
not mean that the trial court should 
rewrite it to correct the legislative er¬ 
ror. That is not its job under our legal 
system. 

Users’ rights. In further support of 
its construction of the phrase “deriva¬ 
tive work,” the Galoob court pointed 
to the nature of the competing inter¬ 
ests at stake. It said the copyright law’s 
purpose is to balance “a fair return on 
an author’s creative labor against the 
need for ‘broad public availability of 
literature, music, and the arts.’ ” Galoob 
sells Game Genie to users who have 
already paid Nintendo its price for the 
video game cartridge. Users modify the 
games only for personal enjoyment, not 
for commercial gain. The conduct is 
analogous to skipping commercials on 
a videotape of a television program by 
fast-forwarding, or rewinding and view¬ 
ing in slow motion a critical play of a 
football game. None of this, the court 
said, deprives the copyright owner of 
the opportunity to derive “current or 
expected revenue.” 

(The facts are somewhat more com¬ 
plicated than the court paints them, 
although on balance its assertion may 
accurately characterize them. The court 
did not mention here that Nintendo 
sought to gain added revenue from its 
customers by marketing somewhat dif¬ 
ferent game modification devices to 
them. One could argue, therefore, that 
to whatever extent Galoob satisfies this 
market, Nintendo cannot instead sat¬ 
isfy it for its own profit and is there¬ 
fore deprived of expected revenue. In 
response, Galoob might make two 
points. First, to date Nintendo has ne¬ 
glected the needs of these particular 
customers and left them unsatisfied or 
else they would not be Game Genie 
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customers. Second, why should 
Nintendo have a monopoly over this 
ancillary market? That is something to 
be decided, not assumed.) 

The court summed up the equities 
of the case as follows: “Having paid 
Nintendo a fair return, the consumer 
may experiment with the product and 
create new variations of play, for per¬ 
sonal enjoyment, without creating a 
derivative work .... For these reasons, 
this Court finds that the Game Genie 
does not create a derivative work pro¬ 
tected by the copyright laws.” 

The result may be correct, but the 
legal analysis has gaps. The court’s ar¬ 
gument is sound for creation of a de¬ 
fense of estoppel, privilege, or implied 
or constructive license. But such an 
affirmative defense is quite distinct from 
the proper statutory construction of the 
phrase “derivative work.” 

Fair use. As an alternative holding, 
and assuming in the course of the ar¬ 
gument that Game Genie prepared a 
derivative work, the court went on to 
find a fair use. Fair use is a statutory 
privilege codifying a series of judicial 
decisions favoring certain uses of a 
copyrighted work as privileged or im¬ 
munized from liability. The privilege is 
the end user’s in the first instance. But 
it extends to a person charged with 
contributory infringement liability be¬ 
cause of responsibility for an end user’s 
conduct—for example, if the person 
sold the end user the equipment used 
to commit the alleged copyright in¬ 
fringement. There can be no contribu¬ 
tory infringement without direct 
infringement. Hence, if the end user’s 
alleged direct infringement is privileged 
as fair use, then the accused contribu¬ 
tory infringer cannot be punished in 
damages for contributing to a nonex¬ 
istent direct infringement. 

That the end user’s use is noncom¬ 
mercial creates a rebuttable presump¬ 
tion in favor of fair use. Here, the end 
user uses Game Genie for private home 
entertainment. This places the facts of 
the case on a par with those of Sony 
Corp. v. Universal City Studios, Inc? In 


that case, the Supreme Court found it 
fair use for users of home videotape 
recorders to record broadcasts for later 
viewing. 


If the user's 
conduct is 
privileged as fair 
use, the supplier 
cannot be 
punished for 
contributing to a 
nonexistent 
infringement. 


Another factor in favor of a finding 
of Game Genie fair use, the court con¬ 
cluded, was that the end users had al¬ 
ready paid Nintendo for the video game 
cartridge. The court felt that purchase 
of the game cartridges gave customers 
a right to maximize their enjoyment of 
their purchase, including by modify¬ 
ing how the product worked. 

Finally, the most important factor in 
the fair-use analysis was that Nintendo 
could not show that Game Genie 
would adversely affect the market for 
the copyrighted work by diverting sales 
away from Nintendo. The court rejected 
Nintendo’s claim that Game Genie 
harmed the “Nintendo Culture,” a con¬ 
cept the company promoted as “the 
apex of Nintendo’s marketing strategy 
...a [customerl mind-set intentionally 
created by Nintendo.” Part of the 
Nintendo Culture, according to its mar¬ 
keting experts, is peer rivalry among 
video game players, who gain prestige 
by achieving high scores in the game. 
Players verify their achievement by 


photographing a screen displaying the 
high score. If everyone could get high 
scores with Game Genie, Nintendo 
said, then “this socially reinforcing prac¬ 
tice would fall by the wayside,” and 
Nintendo would lose future sales. 

The court refused to believe this 
theory because Nintendo was itself 
marketing products and a magazine 
that helped players modify game play 
in a manner similar to Game Genie. In 
any event, the court would not award 
Nintendo in litigation what Nintendo 
could not achieve by its competitive 
efforts in the marketplace-“the exclu¬ 
sive right to modify game play as it 
alone sees fit”-because the Copyright 
Act does not bestow that power on 
copyright owners. 

Implications for add-on pro¬ 
grams. Much of what the court said 
carries over from patching video game 
computer programs with a Game Ge¬ 
nie to patching an application with an 
add-on program. First, the modified 
code exists temporarily in RAM and is 
not written into ROM. That, of course, 
disregards the peculiar history of the 
part of the Copyright Act dealing with 
copyright infringement by preparation 
of derivative works. 

More important, the modification 
occurs for the benefit of an end user 
who has already purchased a copy of 
the underlying copyrighted computer 
program. By the same token, the copy¬ 
right owner has already been paid once 
for the right to use the program. The 
end use may be for commercial or 
noncommercial purposes, depending 
on the user. No one sells or transfers 
the modified program to others. Finally, 
one can assert that use of the add-on 
program will not divert sales away from 
the copyright owner. 

These factors, on balance, suggest 
that the verdict in an add-on copyright 
infringement suit should be against the 
copyright owner and in favor of end- 
user rights. But the chain of argument 
summarized above has defects which 
might lead a different court to reach a 
different result. On the other hand, the 


April 1992 77 









Micro Law 


Galoob court left unmentioned other 
arguments favoring add-ons and end 
users’ rights which, if properly pre¬ 
sented, might lead another court to the 
same result by another route. 

Alternative analyses. The Galoob 
court’s opinion is a dog’s breakfast. 
First, a derivative work was probably 
prepared, even though it was not a 
fixed, tangible copy. The existance of 
such a copy indicates the need to con¬ 
sider possible affirmative defenses (fair 
use is only one) that may negative the 
case of copyright infringement. Even 
when a copyright is infringed, some¬ 
times circumstances excuse the in¬ 
fringement. 

There may well have been a fair use, 
but the court’s legal analysis of fair use 
is flawed by its overstatement of the 
case. Such overstatement is inherent in 
fair-use analysis. The legal standard for 
determining whether a use is fair calls 
for a balance of four incommensurable 
factors, in the form, fair use = apples + 
oranges + lemons + grapes. The only 
way the court felt it could balance the 
factors against one another was to say 
that none of them favored the losing 
party. Otherwise, the court would have 
been compelled to decide whether the 
apples carried more legal weight than 
the oranges—a daunting task. 

(By way of analogy, to find a mini¬ 
mum or maximum of F(x,y,z,t), you 
must solve for the partial derivatives 
of F with respect to x, y, z , and / each 
successively being set to zero. Other¬ 
wise, you cannot tell that you have a 
peak rather than merely a point along 
a ridge or a saddle point.) 

The court defined away the prob¬ 
lem by assuming that Game Genie sales 
were not diverting sales of Nintendo’s 
copyrighted product. That tipped the 
most important of the fair-use factors 
wholly in Galoob’s favor. But Nintendo 
was, by the court’s own account, mer¬ 
chandising a set of ways to prepare 
derivative-work versions of the copy¬ 
righted video games, while Game Ge¬ 
nie provided another. Arguably, this 
resulted in competing versions of the 
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copyrighted work in the same market¬ 
place. That fact, if it is one, casts doubt 
on the correctness of the court’s con¬ 
clusion that the alleged copyright in¬ 
fringement did not supplant Nintendo 
to any degree as a seller of the copy¬ 
righted work. 


An add-on 
program can dry 
up the market for 
later versions. 


The same kind of thing can occur 
with an add-on program. Consider 
4DOS, a shareware add-on program 
that interacts with MS-DOS. 4DOS, 
which has been available for several 
years, offers its users many functions 
that Microsoft did not include in MS- 
DOS 2 and 3—for example, on-line 
help with DOS command meanings 
and ability to recall and edit prior com¬ 
mand entries (by using arrow keys). I 
am in no hurry to upgrade to MS-DOS 
5, since 4DOS already provides most 
of what I would get from it (and most 
or all of the rest is found in old, al¬ 
ready-installed programs such as lDir+ 
and PC Tools). An add-on program can 
thus dry up the market for later ver¬ 
sions of the underlying program. 

That does not necessarily tip the fair- 
use analysis in Nintendo’s favor or 
against add-on programs in general. But 
the need to attempt an apples versus 
oranges fair-use analysis, instead of de¬ 
fining it away, makes the fair-use ap¬ 
proach much more precarious. An 
alternative legal analysis may therefore 
be preferable, if it supports the same 
result. 

The court’s repeated emphasis on 
user rights points the way toward sev¬ 
eral possible such analyses. One is the 
legal doctrine of implied license. In 


patent law, a purchaser of a patented 
product has an implied license to 
modify it to increase its value. For ex¬ 
ample, the purchaser of a canning ma¬ 
chine may modify it to process a 
different size can. The implication of 
the license is apparently by action of 
law, not from the surrounding facts. 
That is, the court implies the license 
based on its sense of fitness, not on 
the basis that the parties actually in¬ 
tended to agree to a license. That sug¬ 
gests that an attempted disclaimer of 
the implied license by the patent owner 
would be ineffective. The argument 
from patent law has not yet been car¬ 
ried over to copyright law. It may fail, 
but it is worth considering. Implied li¬ 
cense would seem to be at least as 
strong an argument as fair use. 

A related alternative legal argument is 
estoppel: The seller is estopped from 
preventing the customer from fully en¬ 
joying the use of purchase. There ap¬ 
pears to be no substantial difference 
between estoppel and implied license. 
Estoppel is just another name for the idea 
that, by selling the product to the cus¬ 
tomer, the seller has acted in a manner 
inconsistent with preventing the customer 
from fully benefiting from the sale. 

The doctrine against derogation from 
grants provides yet another legal argu¬ 
ment amounting to the same thing. In 
British Leyland Motor Corp. v. 
Armstrong Patents Co.? the House of 
Lords held that a car manufacturer, who 
owned copyrights covering tail pipes 
and other spare parts, could not re¬ 
quire car owners to procure spare parts 
only from its licensees. The court con¬ 
sidered that to use copyright law for 
this purpose would derogate from the 
title to the car that the manufacturer 
had conveyed to the customers upon 
sale. Any added expense or inconve¬ 
nience imposed on car purchasers that 
interfered with their enjoyment of the 
purchased goods would derogate from 
the grant of title. Therefore, the court 
would not permit the seller to assert its 
copyrights to cause such results. 

Finally, probably the strongest alter- 










native legal argument, one wholly 
unmentioned by the Galoob court, 
could be based on section 117 of the 
Copyright Act. Section 117 gives own¬ 
ers of copies of computer programs a 
right to make adaptations of the pro¬ 
grams when they do so as “an essen¬ 
tial step” in their utilization of the 
computer programs. The NES console 
is a computer, within any reasonable 
definition of that term. It is a low-end, 
special-purpose microcomputer having 
a microprocessor chip as central pro¬ 
cessing unit. The video game cartridge 
contains a stored computer program, 
among other things. At least prima fa¬ 
cie, the fact situation is that defined by 
section 117. 

The only problem areas are 

• Does the fact that modifying the 
computer program also modifies 
an audiovisual work take the case 
outside section 117? 

• Is the adaptation “an essential 
step” in utilization of the computer 
program? 

On the first point, a court would 
probably regard the program and au¬ 
diovisual display as a unitary work. 
That is how the Copyright Office reg¬ 
isters them. Therefore, the right to 
modify the program should carry with 
it the right to let the modifications 
change the audiovisual display since 
the two are a single legal unit. 

The second point is less predictable. 
Legal authority is divided, but the bet¬ 
ter view is that “an essential step” is 
one intrinsic to the contemplated use 
of the program or one that the end user 
strongly desires to accomplish. 5 

These alternative rationales for the 
Galoob court’s result, singly or in com¬ 
bination, would probably have given 
stronger support to the judgment than 
the fair-use analysis did. They are not 
different in kind, however, from the 
court’s rationale of focus on users’ 
rights. Where the proposed alternatives 
differ from the court’s approach is that 
they seek to provide a legal theory cen¬ 


trally based on users’ rights rather than 
to invoke such rights as incidental sup¬ 
port for another legal theory. 

One may quarrel with the chain of 
legal reasoning in the Galoob decision, 
but Galoob definitely tells us something 
important. It suggests that courts favor 
the rights of the customer in an add¬ 
on situation, at least when copyright 
owners cannot show that the customer 
is taking a free ride at their expense. 
As Justice Black once said about re¬ 
pair and reconstruction of patented 
products, “One royalty to one paten¬ 
tee for one sale is enough.” 6 

Courts are likely, therefore, to feel 
that customers deserve considerable 
freedom to modify a purchased com¬ 
puter program. The sentiments about 
relative equities expressed in the 
Galoob decision may thus be more 
precedential, for prediction purposes, 
than the legal analysis. 
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I installed Disk Express II and 
watched it go through its paces. I don’t 
think my hard disk was badly frag¬ 
mented, so I’m not sure how much 
improvement I’ve seen, but I am sure 
of two things. Disk Express II is easy 
to use, and I feel a lot better knowing 
that its many features will be available 
if odd situations arise: 

Multidisk is a hard-disk partitioning 
program. It allows you to place files 
into logical groups kept together on 
your disk to minimize access times. A 
number of protection features provide 
advantages on networked systems or 
on systems that more than one person 
can access. 

Alsoft also claims that partitioning 
your disk provides another layer of 
protection against computer viruses. I 
have not had virus problems on my 
Mac, but my PC was recently infested 
by the notorious Michelangelo virus. 
That’s a story for another column. 

Disk Check diagnoses disk and di¬ 
rectory problems. It couldn’t find any 
problems with my disk or directories, 
but it did help me identify and remove 
an invisible anchored file belonging to 
a program I got rid of long ago. 

If you use your Macintosh regularly, 
Disk Express II will certainly improve 
your efficiency. I think it’s worth the 
modest price. 

I don’t have space to describe all of 
the other interesting programs I’ve re¬ 
ceived recently. Some of these will 
appear in future columns. 
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Regression testability 


[This issue, Lee White and Hareton Leung make 
an argument for a more systematic approach to 
regression testing in software development. 

I invite readers to send information on a tool 
or method that solves problems, for consideration 
in future columns. - C. WJ 

Lee White 

Case Western Reserve University 

Hareton K.N. Leung 
Bell-Northern Research 

H esigners consider many criteria when 
writing software. In these times when 
change is so common, one of the most 
important is maintainability, which strongly cor¬ 
relates with other desirable criteria. We propose 
a more systematic approach to an important as¬ 
pect of maintainability: regression testing. 

We can differentiate between perfective main¬ 
tenance, which enhances software functionality, 
and corrective maintenance, which detects and 
corrects defects. Whether the maintenance is 
perfective or corrective, we must ensure that it 
does not inadvertently affect unmodified 
functionalities. When this occurs, we call it a 
regression error. Regression testing uses test data 
previously developed for these unmodified 
functionalities to detect such regression errors. 

Regression testing is important at the unit, inte¬ 
gration, and system (or functional) testing levels. 
However, software development teams usually 
have responsibility for unit and integration testing. 
These teams do not consistently apply regression 
testing at these levels when they make changes 
and often do not even systematically retain test 
data. System or functional testers, on the other hand, 
are very systematic about keeping test data and 


applying regression testing. This costs more than 
detecting regression error earlier. 

We recently completed a research project, 1 ' 2 
where for small changes we endeavored to iden¬ 
tify subsets of test data to use as regression tests 
at the unit, integration, and system levels. Our 
idea was to cut the cost and time to run the 
regression tests for numerous small changes in 
the software and to focus on the areas of tests 
related to code or functionalities of the change. 

Figure 1 illustrates the approach for regres¬ 
sion testing at the unit level so a subset of the 
test data can detect a unit regression error. The 
static analyzer detects those test data, called 
reusable tests, that cannot be affected by the 
program changes. We could accomplish this 
detection with static slices, as proposed by 
Weiser, 3 which identify statements that could 
be affected by program changes. If this fails to 
lead to a sufficient reduction in test data, we 
could use dynamic slicing, as introduced in 
Korel and Laski, 4 which indicates statements 
actually affected by program changes. 

The dynamic analyzer executes the remain¬ 
ing tests and identifies obsolete tests that no 
longer achieve their intended function since 
the program has changed. The remaining test¬ 
able tests reveal regression errors. We require 
new tests to update functional tests if the speci¬ 
fication of the module has changed, or to ob¬ 
tain structural tests to achieve a specified level 
of coverage. To do so, we can use the module 
dynamic tester shown in Figure 1. 

The method we propose may result in a higher 
payoff situation for regression testing at the inte¬ 
gration level. We endeavored to develop a gen¬ 
eral approach that does not depend on the 
particular type of integration (such as top-down, 
bottom-up, or hybrid), as long as it is incremental. 
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On the Edge 


Firewall. A key question is, given 
the modified modules, how many 
modules must we retest? In other 
words, where can we draw a firewall 
around those modules which must be 
retested? When we detect errors, we 
should make changes so as to keep 
the firewall from spreading to include 
more system modules if possible. 

To make this analysis precise, we as¬ 
sume that all module dependencies are 
indicated in the call graph. Figure 2 shows 
an example call graph with four modi¬ 
fied modules C,-C 4 . This is a severe as¬ 
sumption because global variables are 
another common dependency in many 
programs, in which a number of mod¬ 
ules may define global variables used by 
otherwise unrelated modules. I will briefly 
discuss global variable testing later. 

Resource dependencies may exist 
between modules. An example is 
memory, in which the resource is con¬ 
strained between a number of mod¬ 
ules. We assume to indicate a precise 
result, the only errors present are due 
to the modified modules from the main¬ 



tenance effort. We also assume reliabil¬ 


ity of the unit and integration tests. (I 
will return to this assumption at the 
end of this analysis.) 

Given these assumptions, we analyzed 
a number of basis cases describing mod¬ 
ule dependencies in the call graph. 2 The 
possibilities for a module a are 

• no change in a, NoChCa), 

• only code change in a, CodeCh(a), 
and 

• change in the specification of a , 
SpecCh(«). 

If module a calls module b, then we 
can ignore any cases in which neither 
a or b is changed. This leaves us with 
eight cases to consider. We must also 
model the addition of new modules 
and deletion of modules for which we 
have identified several more cases. 

Figure 3 on the next page indicates 
the four critical boundary cases. Two 
cases correspond to an unchanged 
module calling a modified module, and 



Figure 2. Basis cases and the calculation of a firewall, indicated by bold arrows. 


82 IEEE Micro 















































































Basis case 2 Basis case 4 


Figure 3. Boundary cases for firewall 
construction. 

the other two correspond to a modi¬ 
fied module calling an unmodified 
module. Case 6 in Figure 3 is unusual 
in that one would not expect an un¬ 
changed module to call a module in 
which the specification has changed. 
This unusual situation creates two con¬ 
sequences. 

• We should reexecute not only the 
integration tests between modules 
but also the unit test of the calling 
module. 

• Case 6 is the only case in which 
we cannot guarantee that only the 
modified module needs to be 
changed if any tests detect an er¬ 
ror. If the calling module is modi¬ 
fied, the firewall expands. 

The other three cases are simpler and 
can be characterized by the following 
observations: 

• We need to reexecute only the 
integration tests between the two 
modules in each case. We do not 
have to rerun the unit tests for the 
unchanged module. 

• If any of the tests detects an error, 
we can correct the error by chang¬ 
ing only the modified module. The 
programmer should not change 
the unmodified module. If we fol¬ 


low this discipline, the firewall 
does not expand. 

To illustrate the firewall concept, 
return to Figure 2. The four modules 
C r C 4 are given as modified. We must 
rerun integration tests U 2 -Cj, U 2 -C 3 , C 2 - 
U 4 , C,-U 7 , C 3 -U 8 , C 4 -U 5 , and C 4 -U 6 , but 
no unit tests for unchanged modules 
need be rerun. Figure 2 shows the 
firewall as bold arrows that separate 
the affected modules from the rest of 
the call graph—just as the firewall does 
in real testing. 

I must make a final remark about 
our assumption that no errors exist in 
the system other than those due to the 
modified modules and that all unit and 
integration tests are reliable. Of course 
these assumptions do not hold true in 
practice, and thus our precise conclu¬ 
sions are no longer valid. However, we 
can make these conclusions practical 
by observing that testing the modules 
within the firewall is a sensible use of 
testing resources, even if we cannot 


rule out errors existing elsewhere. 

Computational experience. We 

also conducted an experiment to evalu¬ 
ate these regression testing concepts. 2 
We used a student database program 
with 20 distinct modules, 32 modules, 
seven software features (major software 
functions in the specifications), and over 
550 executable lines of Pascal code. The 
author provided four real modifications 
that we could evaluate with regression 
testing using reduced tests. 

Table 1 shows the results of this 
study. Modifications 1, 2, and 4 show 
considerable reductions in the number 
of required tests, but modification 3 
shows little reduction. The top four 
lines in Table 1 show the reason for 
this. Modification 3 has a slightly higher 
number of affected modules or mod¬ 
ule interactions than the other modifi¬ 
cations. The biggest difference is in the 
number of affected features. Since 
modification 3 affects all features, the 
number of system tests does not de¬ 
crease. Note that the design was good 


Table 1. Regression testing evaluation. 




1 

Modification 

2 3 

4 

Affected source lines 

25 

80 

23 

57 

Affected modules 


2 

4 

8 

8 

Affected module interactions 


2 

3 

16 

10 

Affected features 


1 

1 

7 

1 

Regression tests 

Unit 

15 

22 

50 

40 

Integration 

32 

32 

120 

80 

System 

46 

24 

130 

38 

Total tests 

Unit 

27 

40 

67 

66 

Integration 

246 

278 

275 

307 

System 

106 

130 

130 

158 

Total tests 

379 

448 

472 

531 

Regression tests 

93 

78 

300 

158 

Percentage 

24% 

17% 

63% 

30% 


April 1992 83 































1992 Gordon 
Bell Prize 

For Outstanding 
Achievements in the 
Application of Parallel 
Processing to Scientific and 
Engineering Problems 

Entries are due May 1,1992, with 
finalists to be announced by June 
30 and winners announced at the 
Supercomputing 92 conference in 
November 1992. Prizes of $1,000 
each will be awarded in two of 
three categories: 

• Performance, based on megaflop 
rate on a machine with known 
performance compared against 
similar applications. If this is not 
possible, entrants should docu¬ 
ment their performance claims. 

•Price/performance, based on 
performance divided by the cost 
of the smallest practical computa¬ 
tional engine, including critical 
peripherals. Performance mea¬ 
surements will be evaluated as 
for the performance category. 

•Compiler parallelization, based 
on the most speedup, measured 
by dividing the wall-clock time of 
the parallel run by that of a good 
serial implementation of the 
same job. 

General conditions include dem¬ 
onstrating the utility of the pro¬ 
gram and machine. The judges will 
also consider how much the entry 
advances the state of the art in some 
field. 

For more information or to enter, 
contact: 

1992 Gordon Bell Prize 
c/o Marilyn Potes 
IEEE Computer Society 
10662 Los Vaqueros Circle 
P0 Box 3014 

Los Alamitos,CA 90720-1264 
Phone:(714)821-8380 
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enough to anticipate modifications 1, 
2, and 4 but was not maintainable for 
modification 3. 

To establish the effectiveness of the 
test subsets and find which tests de¬ 
tected errors, we asked the program¬ 
mer to introduce 13 logic emors. Of 
these, the tests detected 12, with the 
subsets as effective at detecting these 
errors as the full test sets. Of the 12 
errors detected, 

• unit testing detected eight errors, 

• integration testing detected 11 er¬ 
rors, and 

• system testing detected 12 errors. 

Some lessons here mirror current 
practice. We could avoid bothering 
with unit or integration testing and 
conduct only system testing, but this 
would be very expensive and time- 
consuming. The values of both unit test¬ 
ing and integration testing are clear. 
Integration testing detects different er¬ 
rors than does unit testing. Developers 
should not only perform both types of 
testing but also save the data to do re¬ 
gression testing at these two levels. 

Global variables. We also studied 
regression testing global variables’ and 
found that the global variables may be 
treated as parameters passed between 
modules for the purpose of regression 
testing. This is despite the fact that glo¬ 
bal variables are insidious in that many 
modules may become dependent 
through extensive use of global vari¬ 
ables. A change in one or a few mod¬ 
ules may affect change in modules 
throughout the entire system. 

We are completing research on how 
developers should test global variables. 
The study is complex, but should pro¬ 
vide an approach for developers to 
actually test the effects of global vari¬ 
ables they insist on using. 
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Second-generation RISC chips 

Ware Myers, Contributing Editor 

Authors presented papers describing 10 sec¬ 
ond-generation RISC processors at Compcon 
Spring 92, in February. Table 1 lists the papers, 
which are contained in the Digest of Papers, 
available from the Computer Society. 

Most of these papers address new or unusual 
design features, so they do not contain complete 
application data. The most advanced chips have 
line widths of 0.75 to 0.80 (im and contain more 
than one million transistors. Most of them achieve 
high performance by using both superscalar and 
superpipeline techniques. With a million plus 
transistors the chip can, of course, contain many 
performance-enhancing features. 

DEC Alpha. The Alpha is not merely a new 
chip. It is a new architecture that DEC intends "to 
withstand the test of time” for the next 25 years. 
EV4 is the first implementation of this architec¬ 
ture. DEC envisions more powerful implementa¬ 
tions in coming years. “Future generations will be 
able to deliver up to a 1,000-fold increase in per¬ 
formance,” the company said in its announcement. 

The first implementation of Alpha is in 0.75- 
M-m, 3.3V, CMOS technology. It contains 1.68 
million transistors on a 1.68 x 1.39-cm chip with 
431 pins and operates at 200 MHz. The 64-bit 
CPU can issue two instructions each clock cycle 
to two of the four pipelined functional units: in¬ 
teger, floating point, branch, and load/store. Thus, 
the peak issue rate can reach 400 MIPS. 

Alpha’s architecture accommodates Digital’s 
Open Advantage by supporting both OSF/1 and 
Open VMS operating systems. Thus, it provides 
an upgrade path from DEC’S existing VAX archi¬ 
tecture and an opportunity for any organization 
using Unix. DEC plans to license it to anyone. 

Alpha can be employed as a single CPU in 


personal workstations, or in large aggregations 
it can form massively parallel systems. Cray Re¬ 
search plans to use it in this way. 

HP reduces path length. “The path length 
of a computation is the number of instructions 
needed to process the computation,” Ruby Lee 
of Hewlett-Packard’s Cupertino, California, fa¬ 
cility, pointed out in the session devoted to HP’s 
PA (precision architecture) RISC architecture. 
“The execution time of a computation is the prod¬ 
uct of the path length in instructions executed, 
the average cycles taken per instruction, and the 
cycle time of the processor.” 

It follows then that we can improve the per¬ 
formance of a processor if we can reduce one 
or more of these variables. Cycle time depends 
largely on the underlying technology. In the case 
of HP’s current implementation, clock time is 
down to less than 10 ns. Superscalar and 
superpipelining reduce the cycles per instruc¬ 
tion. RISC architecture originally speeded up 
processing by limiting the instruction set to rela¬ 
tively fast-executing instructions, permitting clock 
time to be reduced. 

According to Lee, “The PA-RISC approach is 
the first to specify a datapath to meet minimum 
requirements, then to find opportunities to ex¬ 
ercise multiple independent functional units with 
a single instruction so as to minimize path lengths 
and improve cost performance.” 

This opportunity is found in “multi-op” instruc¬ 
tions that combine two or three operations in 
the 32-bit instruction after the fashion of VLIW 
(very long instruction word) architecture. A single 
instruction, thus, can initiate several operations 
on separate hardware resources in parallel. In 
effect, this technique manages to do more work 
in a single instruction, shortening the path length 
of instructions implemented in this way. 

First, the design team measured the frequency 
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Table 1. Papers on recent RISC microprocessors, as contained in the 
Digest of Papers, Compcon Spring 92. 


Processor 

Company 

Lead author 

PA-RISC 1.1 

Hewlett-Packard 

Eric DeLano 

Supersparc 

Sun Microsystems 

Greg Blanck 

Pinnacle-1 

Texas Instruments 
Cypress/Ross Technologies 

Raju Vegesna 

(Sparc-compatible) 

88110* 

Motorola 

Keith Diefendorff 

Alpha (EV4)** 

Digital Equipment 

Richard Sites 

NVAX** 

Digital Equipment 

Mike Uhler 

PowerPC** 

Apple/IBM/Motorola 

Ron Hochsprung 

i960 Cx 

Intel 

Elliot Garbus 

Am29030/35 

Advanced Micro Devices 

Scott McMahon 

LR33020 X-terminal 

LSI Logic 

Robert Tobias 


controller 

*See page 40 in this issue. 

** Extension of IBM RS6000 architecture 


of occurrence of pairs or triples of op¬ 
erations. Then it selected several dozen 
to implement. “None of these multi-op 
instructions are difficult to generate 
from a compiler’s point of view, since 
they map naturally to generic program¬ 
ming constructs,” Lee noted. “The ad¬ 
ditional hardware required for these 
multi-op instructions is minimal com¬ 
pared to the additional performance 
provided.” 

Typically one of these multi-op PA- 
RISC instructions replaces two or three 
single-op instructions. In a few in¬ 
stances the replacement rate was in the 
range of five to 10 single-op instruc¬ 
tions. “Overall, systems based on the 
PA-RISC architecture have achieved 
extremely competitive performance on 
both technical and commercial bench¬ 
marks,” Lee concluded. 

Metrology report 

Measurement technology is the key 
to boosting semiconductor productiv¬ 
ity, according to a US Department of 
Commerce report. Metrology for the 
Semiconductor Industry suggests that 
advances in metrology lead to break¬ 
throughs in semiconductor technology. 


Manufacturers who are better able to 
detect defective chips—and to prevent 
defects from occurring—develop more 
efficient processes. Furthermore, as 
designers build smaller chips and 
incorporate an increasing number of 
transistors, the margin for error in man¬ 
ufacturing shrinks. 

The federal report cites several 
sources contributing metrology tech¬ 
nology that can help manufacturers 
keep pace with microprocessor re¬ 
search, including federal agencies, uni¬ 
versities, corporations, and cooperative 
research groups. A free copy of the 
report is available from Jane Walters, 
B3444 Technology Building, National 
Institute of Standards and Technology, 
Gaithersburg, MD 20899. 

Germany recycles old 
machines 

A new law will require German 
manufacturers to take back old elec¬ 
tronic equipment, beginning in 1994. 
The Electronic Waste Order aims at re¬ 
ducing the 800,000 metric tons of elec¬ 
tronic waste that reach the country’s 
incinerators and dump sites each year. 
Some manufacturers already take back 
worn-out equipment and dismantle it 


Micro bits 

Computer manufacturer Silicon 
Graphics will acquire Mips Com¬ 
puter Systems. Some analysts see 
the move as an attempt to keep 
the Mips architecture competitive 
in the race to control the brains of 
the next generation of personal 
computers. Meanwhile, Intel, 
whose 386 and 486 microproces¬ 
sors form the basis for the current 
generation of personal computers, 
announced it had signed a letter 
of intent to share technology with 
VLSI Technology. 

Cray Research and Sun Mi¬ 
crosystems will share hardware 
and software technology to cre¬ 
ate a seamless environment for 
Sun’s systems and Cray’s super¬ 
computers. Cray recently formed 
a subsidiary, Cray Research 
Superservers, to make and sell 
Sparc products and joined Sparc 
International, the consortium to 
promote scalable processor archi¬ 
tecture. The supercomputer inno¬ 
vator plans to introduce a 
massively parallel system next year 
with a peak performance of 100 
Gflops. 

The neural networks in Janus 
translate spoken sentences into or 
from English, German, or Japa¬ 
nese. Carnegie Mellon, Siemens 
AG, ATR of Kyoto, and the Uni¬ 
versity of Karlsruhe collaborated 
to build the 400-word, continuous 
speech system. 

The University of New Mexico 
distributes Khoros, a software de¬ 
velopment environment for infor¬ 
mation processing and data 
visualization, at no charge through 
file transfer protocol sites. The dis¬ 
tributed computing system, runs 
on Sun, DEC, IBM, Hewlett- 
Packard, Next, Mips, and Cray 
machines and is accessible on e- 
ntail at pprg.eece.unm.edu. 
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Micro News 


Editor-in-Chief Dante Del Corso 
has appointed three new members 
to the editorial board of IEEE Micro. 

Teresa H. Meng is an assistant 
professor of elec¬ 
trical engineering 
at Stanford Uni¬ 
versity. She will 
review manu¬ 
scripts for the 
magazine. Meng 
earned a BS de¬ 
gree in electrical engineering at Na¬ 
tional Taiwan University and MS and 
PhD degrees in electrical engineer¬ 
ing and computer science at the 
University of California, Berkeley. 


Three join editorial board 

Gilles Privat, a research engineer with 
France Telecom, the National Center for 
the Study of Telecommunications in 
Grenoble, will also review manuscripts. 

1 He heads a research 
group investigating 
| areas of parallel algo¬ 
rithms and VLSI ar¬ 
chitecture for image 
processing. Privat 
earned engineering 
—I and doctoral degrees 
in signal and systems theory at Telecom 
Paris University. 

Arun K. Sood is a professor of com¬ 
puter science at George Mason Univer¬ 
sity. He will oversee plans for a new 


department that will feature short 
technical notes and “dream chips” 
submitted by readers. Sood received 
a bachelor’s degree from the Indian 
Institute of Tech¬ 
nology in Delhi 
and MS and PhD 
degrees from Car¬ 
negie Mellon Uni¬ 
versity, all in 
electrical engi¬ 
neering. He is a 
member of the IEEE Systems, Man, 
and Cybernetics Society’s Adminis¬ 
trative Committee and recently guest 
edited IEEE Micro’s theme issue on 
database machines. 





to reuse parts. The new order requires 
them to recycle as much metal and plas¬ 
tic as possible. Precious metals will be 
salvaged and reused; other metals may 
be resmelted and used as slag in, for 
example, road paving. 

Multimedia authoring 
program 

A Stanford University programmer 
has developed software that lets stu¬ 
dents and professors create their own 
multimedia presentations with archive 
and original materials. From a Unix 
workstation, students can combine 
video and music with their own re¬ 
corded commentary and typed-in text. 

George Drapeau of the Academic 
Software Development Group of 
Stanford’s Libraries and Information 
Resources created the program called 
Maestro (multimedia authoring envi¬ 
ronment). His prototype workstation 
includes a Sparcstation 2, microphone, 
laserdisc player, CD-ROM player for 
music and data, and stereo speakers. 

The mouse-driven, icon-based inter¬ 
face allows users to access literary 
works available to on-line users of the 


university’s networks. A video editor 
lets users choose segments, add mu¬ 
sic, record voice commentary, and type 
in text and captions. A time line editor 
designates how segments overlap. 

Drapeau says the program is not de¬ 
signed to produce professional presen¬ 
tations, but to create a simple tool that 
lets users concentrate on the task and 
not the computer. He offers the pro¬ 
gram as “freeware” to students. 

Current literature 

The Glossary of Computer Security 
Technology defines security terms used 
by US federal departments and agen¬ 
cies. The glossary provides multiple 
definitions, reflecting various uses by 
different federal agencies. Technical in¬ 
formation on the 176-page publication 
is available from Edward Roback at 
(301) 975-3696. 

National Technical Information Ser¬ 
vice, Springfield, VA 22161; $26 (hard 
copy), $12.50 (microfiche). 

The five-volume ninth edition of the 
Index and Directory of Industry Stan¬ 
dards lists 113,000 standards from 380 
organizations. Twelve-thousand stan¬ 


dards are revised since the last edition; 
8,000 are new. The directory cites stan¬ 
dards by subject, society/numeric, so¬ 
ciety, and ANSI concordance. Volumes 
1 and 2 comprise the US set, volumes 
3, 4, and 5 are the international set. 

Global Engineering Documents, PO 
Box 19539, Irvine, CA 92713-9539; 
$376(completeset), $195(USvolumes 
only), $275 (international volumes). 

The Fall 1991 IEEE Standards Cata¬ 
log Update lists electrotechnology stan¬ 
dards published since the institute 
issued its most recent catalog last 
September. 

IEEE Standards, PO Box 1331, Pis- 
cataway, NJ 08855-1331; free. 
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New 

Products 


Joe Hootman 

University of 
North Dakota 


Devices and components 
Light to frequency 

For precision light measurements, the TSL220 
converts light to digital signals. It comprises a 
photodiode and BiMOS current-to-frequency 
converter and connects directly to a micropro¬ 
cessor or a digital control circuit. The CMOS- 
compatible output voltage is a pulse train with a 
frequency directly proportional to light intensity 
of the diode. 

The device features a dynamic range of 118 
db and output levels of over 100 KHz in office 
desk lighting and as low as 1 Hz in the dark. 
The converter functions with a 5 to 10V power 
supply and in temperature ranges of-25’ to 70‘C. 
It is housed in an eight-pin, clear DIP package. 
Texas Instruments; $4.61 (1,000s). 

Reader Service No. 10 



Texas Instruments' TSL220 


1-Mbit SRAM 

PSM44039 is a processor-specific memory 
(PSM) chip that works as a secondary cache for 
high-end, synchronous RISC processors. The 
1,179,648-bit chip is a self-timed, synchronous, 
CMOS SRAM organized as 128 Kwords by 9 bits. 
Available cycle times range from 15 to 25 ns. 


The PSM chip operates from a 5V supply in tem¬ 
peratures of O' to 70’C. Paradigm; $85 (20-ns 
version, 1,000s). 

Reader Service No. 11 

lOBase-T interface 

The SN75LBC086 differential driver/receiver 
is a one-channel interface for concentrators, re¬ 
peaters, and bridges in twisted-pair Ethernet sys¬ 
tems. The 24-pin, 300-mil device features a 
squelch circuit for noise immunity beyond 
lOBase-T standard requirements. As also required 
by the standard, it provides jabber control, colli¬ 
sion detection, signal quality, and link test func¬ 
tions. A loopback mode permits testing the data 
path while still connected to the network. Texas 
Instruments; $8.40 (1,000s). 

Reader Service No. 12 

One-chip display driver 

For vacuum-fluorescent displays, the M66004 
controller/driver generates 16 characters from 
RAM and 160 characters from ROM, for a vari¬ 
able display length up to 16 display digits in 5 x 
7 segments. Two built-in static points drive LEDs 
or control peripheral ICs. 

A three-line serial bus to the microcontroller 
receives data without needing a buffer. The chip 
also features an eight-step dimmer control, cur¬ 
sor display, and two scan-cycle formats. 
Mitsubishi; $5.12 (1,000s). 

Reader Service No. 13 

Pulse-width modulators 

Designers can build 500-KHz, off-line power 
supplies and DC-to-DC converter applications 
with a line of pulse-width modulators. The 
LT1241-1245 modulators feature temperature- 
compensated reference, high-gain error ampli¬ 
fier, current-sensing comparator, 50-ns current 
sense delay, start-up current of less than 250 | 1 A, 
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New Products 


Neuron chip 

David Sims, Assistant Editor 

Motorola’s MC143150 Neuron Chip 
is a microcontroller with an embed¬ 
ded protocol that forms the heart of 
remote nodes in networks based on 
Echelon’s Lontalk. Lontalk is a distrib¬ 
uted sense and control network pro¬ 
tocol for industrial, commercial, and 
residential systems that is specially de¬ 
signed to transfer small amounts of 
data, and thus reduce costs. 

Because sense and control networks 
transfer relatively small packets of data 
compared to other network systems 
(for example, “Turn down the heat in 
the board room!” instead of “Here is 
the report on last quarter’s produc¬ 
tion. ..”), they can get by with slower 
data transfer rates. Lontalk sends data 
packets of 15 bytes at up to 1.25 Mbps, 
considerably slower than Ethernet’s 10 
Mbps or some of the newer systems 
transmitting up to 150 Mbps. 

According to A1 Mouton, strategic 
planning manager in Motorola’s 
Lonworks Products organization, the 
slower transfer rate is one way to re¬ 
duce the costs of these systems. An¬ 
other is the on-chip incorporation of 
all the functions needed to process 
inputs from sensors and control de¬ 
vices and respond with commands. 
Each Neuron Chip acts as an “intelli¬ 
gent controller” capable of continu¬ 
ing its monitor and command 
functions even if the network gets dis¬ 
connected. 

Mouton compared the difference 
between Lontalk and a centrally 


Data bus (0 to 7) 



Address bus (0to 15)^> 



Motorola's MC143150 Neuron Chip 


based system to the difference be¬ 
tween desktop PCs and a mainframe 
system with “dumb” terminals. 

“If a central computer system goes 
down, everyone just sits there star¬ 
ing at blank screens,” he said. “With 
PCs on every desk, if the network 
fails, you can go on working. You 
just can’t transfer data.” 

On-chip features include 

• three 8-bit pipelined processors, 

• an 11-pin I/O port program¬ 
mable in 24 nodes, 

• two 16-bit timer/counters, 

• a five-pin communications port 
to support network transceivers, 

• a 2-Kbyte SRAM, 

• 512 bytes of EEPROM with 
charge pump, 

• an external memory interface, 

• a sleep mode, and 

• a 48-bit ID number unique to 
each device. 


Mouton also said that the on-chip 
incorporation of a protocol reduces 
the cost of each node. Echelon and 
Motorola hope that success for 
Lontalk will make it a de facto stan¬ 
dard for local network command sys¬ 
tems. Nodes within other systems that 
include the software, systems, and 
components to form a complete node 
can cost up to $50 per unit. Motorola 
expects, with volume production, to 
reduce the cost of the Neuron Chip 
to under $5 by 1994. 

Given the unlimited number of 
nodes acceptable within Lontalk’s 
hierarchical architecture, Mouton sees 
a wide range of potential applications, 
from manufacturing lines to home 
automation, from building security to 
systems in a recreational vehicle. 

The Neuron Chip is packaged in a 
64-pin quad flat pack. Motorola; 
$11.78 (1,000s). 

Reader Service No. 14 


and a high-current totem pole output 
stage suited to drive power MOSFETs. 
The chips incorporate blanking in the 
current sense comparator to prevent 
the leading edge current spike from 
prematurely tripping the comparator. 
Linear Technology; $3-74 (1,000s). 

Reader Service No. 15 


Software 

Utilities boost DOS machines 

PC-Kwik Power Pak’s utilities speed 
the performance of 286-, 386-, and 486- 
based machines by employing a disk 
cache, screen and keyboard accelera¬ 
tors, and a print spooler. A data buffer 


in RAM boosts application speed. Ac¬ 
celerators speed cursor movement up 
to 126 cps and scrolling up to three 
times faster than standard. The print 
spooler stores data for the printer so 
the system can return to an applica¬ 
tion. A disk cache shares memory be¬ 
tween PC-Kwik utilities and lends 
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memory to application programs as 
needed. Other utilities in the set include 
a screen blanker that operates on a 
timer or with a hot key, and a com¬ 
mand-line editor. Multisoft; $ 70. 

Reader Service No. 16 

Algebra system 

Maple V for Amiga DOS is an inter¬ 
active computer algebra system that 
delivers 3D postscript and image file 
format output graphics. Its mathemat¬ 
ics library includes more than 2,000 
functions, supported by an Arexx port. 
The program supports Commodore 
Amiga 1000, 2000, 2500, and 3000 on 
an Amiga DOS version 2.0 or higher 
with 2 Mbytes of RAM and 8 Mbytes 
free disk space. Waterloo Maple Soft¬ 
ware; $450. 

Reader Service No. 17 



Maple V sample output 


Hspice graphic interface 

Graphical Simulation Interface is a 
point-and-click, mouse-driven graphi¬ 
cal user interface that provides inter¬ 
active capabilities for quick analysis of 
Hspice simulations in an X-Windows 
environment. A machine-independent 
file format connects Hspice to GSI so 
users can run Hspice on a mainframe 
and view results graphically on a work¬ 
station. 

GSI also provides concurrent simu¬ 
lation and wave review, point-and-click 
node property and selection, interac¬ 
tive curve measurement, automatic stor¬ 
age of last curve display, and flexible 
viewing with zoom, pan, multiple pan¬ 
els, and multiple simulation data. GSI 
supports most Unix X-Windows work¬ 
stations. Meta-Software; $2,000. 

Reader Service No. 18 


Scientific calculator for Mac 

Micro Math Calc, a desk accessory 
calculator for the Macintosh, offers ad¬ 
vanced features for programmers, math¬ 
ematicians, and engineers. Real, 
complex, and Gaussian numbers can 
carry information on associated units 
and systems. 

Users can enter and view numbers 
in binary, octal, decimal, hexadecimal, 
and with their corresponding ASCII 
characters. The calculator supports 
shifting operations, integer division, 
and logical bitwise operations such as 
Or, Not, And, and Xor. A bits function 
lets users quickly determine binary 
quantities. The calculator complies with 
the Standard Apple Numeric Environ¬ 
ments and IEEE standards. Available 
for System 7, Micro Math Calc requires 
100 Kbytes of memory. Micro Math Sci¬ 
entific Software; $ 99- 

Reader Service No. 19 



Micro Math Calc 


Terminal emulators 

KEAtemi 420, version 2 emulates the 
DEC VT420 terminal for DOS machines 
running Windows. Emulated functions 
include multiple sessions and pages, 
double high/wide characters, and 132- 
column support. Extensions include 
user-definable keyboard mapping and 
attribute-mapped colors. 

The software supports IBM’s en¬ 
hanced keyboard, DEC’S LK250, and 
KEA’s Power Station keyboard. Pull¬ 
down menus display in English, French, 
or German. Version 2 enhancements 
include network/file transfers and script 
language enhancements. Other features 
in this latest upgrade include interfaces 
for TCP/IP, KEAlink TCP, and Super 


Kermit file transfer protocols. 

A second product, KEAterm 340, in¬ 
cludes all the capabilities of KEAterm 
420 and supports Regis, Tektronix, and 
sixel graphics. KEA Systems; $245 
(420), $395 (340). 

Reader Service No. 20 


Signal processing 
hardware and software 

20-MHz signal processor 

The 20-MHz ADSP-21020 floating¬ 
point DSP cycles in 50 ns and calcu¬ 
lates a 1,024-point EFT in 0.96 ms. The 
manufacturer says its architecture, op¬ 
timized for signal processing applica¬ 
tions, suits it for image processing, 
graphics, radar and sonar, speech rec¬ 
ognition, and advanced audio applica¬ 
tions. The chip comes in commercial 
(0‘ to +85°C) and military (-55" to 
+125’C) temperature ranges, in a 223- 
lead pin grid array. Analog Devices; 
$198 (1,000s). 

Reader Service No. 21 

DSP speeds waveform analysis 

Model 683 is a DSP add-on board 
that extends Analogic’s Model 6100 
Waveform Analyzer by a factor greater 
than 300. According to the company, 
the add-on board’s 25-Mflops proces¬ 
sor computes and displays an 8K-point 
FFT in milliseconds and a 16K x 16K 
cross-correlation analysis in one sec¬ 
ond. The speed allows real-time signal 
processing and spectral analysis. Model 
683 uses a 32-bit floating-point DSP 
slaved to Model 6l00’s CPU. Analogic; 
$2,995. 

Reader Service No. 22 



Analogic's Model 683 and Model 6100 
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New Products 


Smooths data 

Data Smoother processes the scattered 
points of original data and creates a 
waveform while preserving any abrupt 
change in values. Release 2.0 for MS- 
DOS enables users to enter data from 
the keyboard or other ASCII sources. The 
program process 1,500 points of data in 
seconds and displays original and 
smoothed data in tables, alphanumeric 
strip chart, or graphically. Users can 
choose from predefined labels or create 
their own and save on disk. Dynacomp; 
$50. 

Reader Service No. 23 

16-bit input module 

The SBX-416 is an isolated, 16-bit 
analog input module that uses a suc¬ 
cessive approximation analog-to-digital 
converter to process at up to 20 KHz. A 
software driver, compatible with Mi¬ 
crosoft and Borland C compilers, gen¬ 
erates a serial data stream. An optically 
isolated external trigger initiates data 
conversion. The module self-calibrates 
on start-up. Systek; $559 (100s). 

Reader Service No. 24 

Asynchronous servers 

Asynserv2 and Asynserv8 (two- and 
eight-port units) allow LAN users to 
share modems; remote users can ac¬ 
cess the LAN and locally process under 
remote control. The two asynchronous 
communication servers support dial-in 
and dial-out communications at up to 
57.6 Kbps per port, allowing 14.4-Kbps 
V.32bis modems with V.42bis data com¬ 
pression to run at full speed. 

The systems include hardware and 
all necessary software to work with IPX 
or Net BIOS LANs. Other features in¬ 
clude call-back security, host keyboard 
locking, screen blanking, multiple file 
transfer capabilities, script language, 
dialing directory, mail, chat mode, and 
pop-up menus. Asynserv measures 38.3 
cm x 8.0 cm x 25 cm and feeds off a 
universal power supply (90 to 260V). 
MNC International; $2,495 (Asyn¬ 
serv 2), $3,895 (AsynservS). 

Reader Service No. 25 


Windows software 

Windows development tool 

Desktop users in multiuser networks 
and client-server environments can 
develop applications in a Windows en¬ 
vironment with Open Insight. It lets 
developers create database applications 
or link to SQL Server, Oracle, or other 
systems to create client-server applica¬ 
tions. Open Insight includes develop¬ 
ment tools and an active data dictionary 
that gives users an integrated view of 
data sources, including dBase, ANSI 
SQL, SQL Server, ASCII, and DB2. 
Quarterly updates, add-on utilities, and 
a year of telephone technical support 
are included. Revelation Technologies; 
$895. 

Reader Service No. 26 

Mac-in-DOS for Windows 

Mac-in-DOS, one of several pro¬ 
grams that copy and convert files be¬ 
tween Macintosh and DOS formats, is 
now available in a Windows format. 
Version 2.0 lets users run the program 
with Microsoft Windows 3.0. Function 
keys perform the main DOS file-han¬ 
dling functions, including changing 
subdirectories and deleting or copying 
files. Pacific Micro; $99. 

Reader Service No. 27 



DOS or Windows applications 

Developers can build Windows or 
DOS applications with APL Plus 11/386, 
version 4. The program creates APL ap¬ 
plications with graphical user interface 
for the Microsoft Windows 3-0 envi¬ 
ronment. Version 4 also interfaces to 
non-APL software, including most DOS 


software with an application program¬ 
ming interface for C programmers. It 
also includes an interface to Borland’s 
Paradox Database Manager, a screen 
interface toolkit, and Super VGA (800 
x 600) 256-color graphics support. 
STSC; $1, 700, $495 (upgrades). 

Reader Service No. 28 

Mouse-driven help systems 

Two windows programs let users 
create help systems or data validation 
entry screens for Microsoft Windows 
without writing code, by pointing and 
clicking with a mouse. Robo Help in¬ 
cludes a tool palette that can simplify 
construction of help systems. It gener¬ 
ates source codes for indexes, catego¬ 
ries, defined terms, and hypertext files. 

Magic Fields lets users develop data 
compilation fields by pointing and 
clicking to predefined data entry fields, 
and adding custom-designed fields. 
Users can specify fonts, colors, and a 
grayed 3D effect. It includes a library 
of numeric, text, alphanumeric, and 
monetary objects. Blue Sky Software; 
$495 (Robo Help), $349 (Magic Fields). 

Reader Service No. 29 

Document management 

DOS users can retrieve or scan word 
processing, database, spreadsheet, or 
graphics files with DE/Cartes Docu¬ 
ment Manager. The icon-based system 
stores files with names up to 64 char¬ 
acters long. Each document can have 
an unlimited number of revisions on¬ 
line. Users can define their own hier¬ 
archy of storage. 

One mouse click selects a file, a sec¬ 
ond accesses a user-defined note, a 
third loads the program. On remote 
systems, access is restricted to one user 
at a time per document, and unautho¬ 
rized users can be locked out. DE/ 
Cartes requires a DOS machine, MS- 
DOS 3-0, Microsoft Windows 3.0, 
graphics card, mouse, 5 Mbytes of hard 
disk space, and 2 Mbytes of RAM. Desk¬ 
top Engineering International; from 
$147.50 (introductoryprice). 

Reader Service No. 30 
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Display and scan 
peripherals 

Video array with interface 

Media Wall, an array of monitors with 
a computer interface, is a multimedia 
presentation system integrating com¬ 
puter animation, graphics, and text with 
full motion and still video. It includes 
an adapter card for the Macintosh, a 
satellite control unit containing special- 
effects hardware, and an array of 
stackable video monitors or projectors. 

Users can display duplicate or dif¬ 
ferent images in a variety of modes. 
Or they can display one high resolu¬ 
tion (3,200 x 2,400) image tiled across 
the array. Monitors align up to 27 in a 
line or circle, or in a grid pattern up 
to five by five. RGB Spectrum; $26,000 
(interface board and controller box), 
$40,000 (complete unit with nine 
monitors). Macintosh not included. 

Reader Service No. 31 

Pen base for desktops 

Displaypad connects to desktop com¬ 
puters to make a pen-based system. 
Information written or drawn on the 
tablet simultaneously appears on the 
monitor. A cordless stylus senses tip 
pressure, height, and angle. The stylus’ 
resolution is 1,270 lines per inch, and it 
has a data output of 200 points/s. The 
tablet features 640 x 480 resolution, 64 
shades of gray, and a flush surface that 
allows smooth pen operation. 

To install, a display card replaces the 
existing VGA card. Displaypad works 
with DOS, Macintosh, and Sun systems. 
Cal Comp; $2,500. 

Reader Service No. 32 



Cal Comp's Displaypad 


Faster rasters 

Two raster plotters for Macintosh 
applications achieve what the manu¬ 
facturer says are the fastest speeds in 
their class. Model 2400, with a 24”-wide, 
D-size format, plots at 4” per second 
and can format and plot a D-size plot 
in under 40 seconds. Model 3600, with 
a 36”-wide, E-size format, outputs 2” 
per second. The 200-dpi, monochrome 
plotters use Microspot Mac Plot DMA 
driver software and a NuBus interface 
card to process output from Claris CAD, 
Mac Project, Pixel Paint, Super Paint, 
Power Point, Freehand, Dreams, and 
other Quickdraw programs. Atlantek; 
$12,500 (2400), $14,500 (3600). 

Reader Service No. 33 



Atlantek's Model 3600 


Prints 30 pages per minute 

The LC-7030 nonimpact printer out¬ 
puts 30 pages per minute. Two disk 
drives hold fonts; ambitious users can 
replace one with a 52-Mbyte hard disk 
for more storage. The controller sup¬ 
ports HP PCL 5, Post Script, and DEC 
LN03 Plus emulations. Connects to a 
printer via RS-232 or RS-422 serial ports, 
Centronics parallel ports, Ethernet, 
TCP/IP, and twinaxial or coaxial cables. 
Advanced Technologies International; 
$16,480 (simplex), $21,430 (duplex). 

Reader Service No. 34 

In-house sign production 

The Image Crafter creates multicol¬ 
ored graphic applications for signage 
and presentations. The package in¬ 
cludes a desktop vinyl cutter/plotter, 
software, and a hand-held scanner. 
Users scan an image into their DOS 
machine (MS-DOS 3.1 or higher). 


There, they can manipulate and clean 
up the image in a window-based pro¬ 
gram, before sending the finished prod¬ 
uct to the plotter. Kroy Sign Systems; 
$2,195. 

Reader Service No. 35 



Kroy's Image Crafter 
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Manufacturer Model Comments 


R.S.# 


Boards 

Brooktrout Technology 


Rohm Corporation 


Star Tech 


Traquair Data Systems 


Software 

Micro Touch Systems 


Motorola 


Systems 

Anorad 


TR112 fax card PC/AT-compatible Twin Channel board contains two transceivers 80 
for multichannel facsimile applications. By taking advantage of di¬ 
rect-dialing services, an autorouting version sends incoming mes¬ 
sages automatically to LAN fax-mail server users. $1,995 and $2,495 
(autorouter). Quantity discounts available. 

Memory cards Credit card-size and smaller SRAM, DRAM, mask ROM, one-time 81 
PROM, and flash-memory cards support applications in which 
small COB solid-state units can replace floppy disks. The 32-Kbyte 
to 6-Mbyte semicustom products promise 100-ns to 150-ns access 
times. From $35 to $950 (100s); 12 weeks ARO. 


860 Edge add-in With 32 to 128 Mbytes of DRAM and math/vector libraries, the 82 
Macintosh II i860 coprocessor accelerates the Pixar Mac 
Renderman photorealistic Tenderer, fitting into one Nubus slot. A 
developer’s version supports the large floating-point operations 
required in scientific applications. $8,000 (32 Mbytes). 


HEPC2 parallel TMS320C40-based, PC/AT-compatible board supports parallel, im- 83 
processor age, and digital signal processing, as well as graphics computations. 

Supporting up to three TIM-40 TMS daughter boards, HEPC2 pro¬ 
vides up to 200 Mflops of floating-point performance (1.1 billion op¬ 
erations/s). From $1,539 (mother board); from $1,500 (TIM-40s). 


Power Keypad PC Unmouse software lets Microsoft Windows and DOS users 84 

control the cursor and execute macros by touching a keypad 
marked on a template inserted beneath the Unmouse glass sur¬ 
face. To click, users press down on the glass. Free upgrade to 
Unmouse owners. 

Smart Model Behavioral-level simulation model for the 68302 integrated 85 

for 68302 multiprotocol processor lets designers develop, debug, and opti¬ 

mize the hardware operation of 302-based designs before com¬ 
mitting to physical prototypes. Smart Model features intelligent 
error-checking, user-defined timing, and VHDL interoperability. 

$4,000 for one-time technical licensing fee, plus Logic Automation 
library license fee. 


Anoguide Controller- and linear motor-equipped series boosts speeds to 86 

AG-12, AG-14 100 ips and up to 120 lbs. of force at a 25-percent duty cycle. 

For noncontact operation in an industrial environment, the 
Anoline brushless, iron core coil assembly attaches to a moving 
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Manufacturer 


Model 


Comments 


R.S.# 


Ariel 


Neotronics/Laser 
Monitoring Systems 


Nighthawk Electronics 


PI Systems 


slide while the stationary permanent magnets are mounted to the 
stage’s base plate. 

IRCAM Designed for compute-intensive applications on the Next Cube, 87 

workstation this signal-processing workstation uses two i860 RISCs combined 

with a DSP56001 to perform at 160 Mflops and 93-5 MIPS. The 
CPOS/FTS real-time operating system provides protected multi¬ 
tasking kernel, memory management, file I/O, interprocessor 
communications, and process management facilities. $14,995; 
university discounts available. 


TE-TC Controller with built-in safety features supports semiconductor 88 

temperature devices that need to be operated at lower than ambient tempera- 
controller tures. A four-digit, seven-segment LED shows either set or 

measured values of temperature, resistance, and current. 


DXS-16 data Several computers can share a number of peripherals (printers, 89 
exchanger modems, file servers) with the 16-Mbyte DXS-16 as a LAN alterna¬ 

tive. Each unit maintains 500,000-bps speeds on up to 16 serial, 
parallel, or 3270 coaxial/twin-axial ports. 


Infolio pen Pen-based, 2.9-lb., 9 x 10 x 1.2-in., PCMCIA-memory computing 90 

computer tool features a 640 x 480-pixel VGA-quality, reflective LCD. Infolio 

integrates a Cal Comp cordless stylus and Motorola 68331-based 
hardware with PDX-framework database software, a graphical 
user interface, and task-specific application software for mobile 
information collection and management solutions. $1,895. 


Miscellany 

I-Con Industries/ 
Antel Corporation 


Multiaccess Computing 


Parsytec 


Multiwire Futurebus+ backplane in A, B, and F profiles with 64- and 128-bit 91 

backplane data widths comes in 5-, 9-, and 14-slot configurations. The 

multilayer board uses precision discrete wires rather than etched 
circuits for signal interconnection and eliminates signal layers and 
the number of corresponding etched voltage and ground planes 
required for impedance reference. 


MCC-1000F 

adapter 


TIP I/O bus 
boards 


Frame relay service adapter card provides Macintosh II users with 92 
multiple, presubscribed virtual connections across metropolitan or 
wide-area networks. The one-board controller occupies one 
Nubus I/O slot on the system board and runs at T1 or fractional 
T1 rates over a T1 facility. $2,995; 60 days ARO. 

TIP series expands I/O on transputer-based parallel processing 93 
systems, transmitting data simultaneously, in parallel, to mul¬ 
tiple transputer nodes via its broadcast function. The broadcast 
rate equals n x 100 Mbytes/s, where n equals the number of 
receivers. Series includes the TIP-VPU/T8 T805 processor 
board, TIP-MFG monochrome frame grabber, and the TIP-CGD 
color graphic display board. From $4,600 each; 3 or 4 weeks 
ARO. 
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Coming 

in June 

IEEE Micro’s associative memories and 
processors issue features articles on: 

• A dynamic associate processor for ma¬ 
chine vision applications 

• A module for a heterogeneous vision archi¬ 
tecture 

• A pattern-addressable memory 

• And more... 

Also, look for On the Edge’s 
discussion of Object Encyclopedia 
Technology standards. 
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engineering, promotes the exchange of technical information among 100,000 members 
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MEMBERSHIP 
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Presents 


The LiSBUS Async I/O System 


A solution of the 1990’s for today’s data transmission problems. 



Outstandingly Simple and Reliable because LiSBUS"" is 

based on a breakthrough technology which uses the impe¬ 
dance of the bus cable to replace binary addresses. Conse¬ 
quently, data transmission management is greatly simpli¬ 
fied and much more reliable than today's equivalent sys¬ 
tems which require expensive software, hardware, and 
personnel investments. 


Outstandingly Practical because it is easy to install and 
operate. No special tools, workbench, or electronics exper¬ 
tise are needed. Anyone can be up and running in minutes. 
Just plug in the external modules and configure the system 
with the user-friendly LiSBUS' 1 " Link Control Software. 
Each external module measures only around 2in. by 2in. 


amount. The Starter Pack includes all the user needs to 
connect four peripherals and a complete set of LiSBUS '" 1 
Software Development Tools to create custom applica¬ 
tions. 


LiSBUS"" Async I/O System: A product of our CommNexus"" line of 
communication systems. GIGATEC is committed to offering products 
and servicing its customers in the best tradition of Swiss quality. We 
provide our customers with: 

• Technical Support. Registered buyers can obtain technical support 
from our qualified engineers. 

• Users and Developers Group. Organized for encouraging software 
developments using the products of the CommNexus"" family. 


For more information and ordering contact: 


RS-232C 


RS-232C 


RS-232C 


In the USA and Canada: 


In Europe: 


Outstandingly Flexible because a user can connect up to 
60 peripherals or computers to a controlling computer 
through their RS-232C (COM) ports. To add peripherals, 
just extend the bus cable and add modules. 

At an Unbeatable Price because at $650* for the LiSBUS"" 
Starter Pack, no alternative offers all these advantages 
combined into one product without spending a much higher 


Toll Free (800) 945-3002 

(excl. Hawaii) 

Mon.-Fri. 9am - 9pm EST 

GIGATEC (USA). Inc. 

871 Islington Street 
P.O. Box 4705 

Portsmouth. NH 03802-4705 USA 
Tel. (603) 433-2227 
Fax (603) 433-5552 


* specifications and prices subject to change without prior notification 

Visa and MasterCard/EuroCard accepted. - CommNexus™ and LiSBUS™ are trademarks of GIGATEC SA. 


GIGATEC SA 

Ch. des Plans-Praz 

1337 Vallorbe SWITZERLAND 

Tel. 41 21 843 37 36 

Fax 41 21 843 33 25 
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